Ss 


exible Rene ne 
gi 
ji 


ay 


ae 


Pad 
Pe 
Se 


Testing for Teachers 


Testing for Teachers 


Henry E. Garrett 


Professor Emeritus, Columbia University 


American Book Company New York 


| Bureau Edni.?sy. Research 
| CaViD Ua ANING COLLEGE 
i 


fred? do Wee GO. a 
Aces, No 15 2 E: TE aage 


Copyright © 1959 by 


AMERICAN BOOK COMPANY 
NEW YORK CINCINNATI CHICAGO ATLANTA 


Garrett: Testing for Teachers. Manu- 
factured in the United States of 
America. All rights reserved. No part 
of this book protected by the copy- 
right hereon may be reprinted in any 
orm without written permission of 
the publisher, 


1357911 EP 12 108642 


DALLAS 


SAN FRANCISCO 


PREFACE 


This book has been written primarily for prospective teachers 
who want to know how mental tests can be of help in their school 
work. It can serve also aså guide for teachers in service. The first 
seven chapters describe the varieties of mental tests and point out 
the usefulness and limitations of each sort. The last three chapters 
deal with the writing of objective items, with the construction 
of classroom tests, and with some of the ways in which mental 
tests can usefully be employed in guidance and counselling. 
Chapter 2 covers in summary fashion the statistical terms and 
procedures most often used with mental tests. I do not believe it 
possible to describe mental tests intelligently without using rele- 
vant statistical terms. At the same time, I think that the classroom 
teacher need not be a psychometrician or testing specialist in 
order to use standard tests in the school. For those who want to 
go further into test construction, there is an Appendix which 
treats statistical method more fully. 

I have found it generally better to teach Chapter 2 before 
taking up a discussion of mental tests themselves—to use it, that 
is, as a preliminary to later chapters. Chapter 2 can then be re- 
ferred to specifically when the various statistical terms occur. 
This procedure has the advantage of reviewing the basic statis- 
tics when the need arises. 

I believe that the book will be found to contain ample material 
for one term’s work. This is especially true when the laboratory 
exercises and questions at the ends of the chapters are covered in 
class discussion, and when reports upon relevant literature are 


required. 
Henry E. Garrett 


, 


CONTENTS 


1. Mental Tests in the Schools 1 
2. Statistics in Mental Testing 13 
3. Individual Intelligence Scales 44 
4. Group Tests of Intelligence 80 
5. Educational Achievement Tests 102 
6. Aptitude Tests 131 
7. Personality Tests 157 
8. Objective-Test Items and Short-Answer Techniques 184 
9. Constructing the Objective Test 210 
10. Some Problems in the Evaluation of Test Scores 227 


Appendix A 
Statistical Supplement 239 


Appendix B 
Publishers of Mental Tests 253 
Glossary 255 


Author Index 259 
Subject Index 260 


List of Tables 
2-1. Frequency Distribution of Fifty Scores on a Test of 
English Grammar 14 
2-2. Frequency Distribution Showing Cumulated Frequencies 
on an English Grammar Test 25 
2-3. Frequency Distribution of 180 Scores Achieved on a 
Clerical Aptitude Test 34 
2-4. M’s and o’s Earned on Five Objective Tests of Educational 
Achievement Given in the Sixth Grade 36 


vii 


viii Contents 


3-1. Illustrative Tests for Stanford-Binet Scale for Years 
IV, X, XIV, and Average Adult 48-49 
3-2. Numbers of Children in the School Population to Be 
Expected at Various IQ Levels 54 
3-3. Educational Expectation in Relation to 1Q Level 55 
3-4. Intelligence Classification for WISC 1Q’s 71 
9-1. Item Analysis of the First Five Items of a Test Made 
Upon Two Criterion Groups, the Highest and the Lowest 
25 Per Cent in Total Score 222 
10-1. Differential Aptitude Scores (Ninth-Grade Class) 228 
10-2. Sociometric Tabulation 234 
A-1. Frequency Distribution of Forty Scores on a Social 
Studies Test 240 
A-2. Computation of the Mean from a Frequency 
Distribution 244 
4-3. Computation of the Median and Q froma Frequency 
Distribution 246 
A-4. Computation of the Standard Deviation (0) froma 
Frequency Distribution 250 
A-5. Correlation between Reading and Arithmetic 
in the Fifth Grade 252 


List of Figures 
2-1. Frequency Polygon of Fifty Scores Achieved by Seventh- 
Grade Children on a Test of English Grammar 15 
2-2, Histogram of Fifty Scores Achieved by Seventh-Grade 
Pupils on a Test of English Grammar 16 
2-3. The Normal Curve 17 
2-4. Negatively Skewed Curve 18 
2-5, Positively Skewed Curve 18 
2-6. Two Distributions with the Same Mean but Differing 
Markedly in Range (Variability) 21 i 
2-7. Areas Under the Normal Curve 23 
2-8. Use of Normal Curve Model to Show Distribution of Sixty 
Scores on a Reading Test 24 
2-9. Ogive or Cumulative Frequency Curve 26 
2-10. Cumulative Frequency Polygon of 180 Scores Achieved on a 
Clerical Aptitude Test 35 


Contents ix 


2-11. Profile of the Percentile Ranks in Various Subjects for a 
Given Child 35 ; 
3-1. Test Materials Used in the Stanford-Binet Scale Facing 54 
3-2. Distribution of 1Q’s on the Stanford-Binet Scale 
for Nearly 3,000 Children, 2-18 Years Old 53 
3-3. Age-Progress Curves for the Stanford-Binet Scale 62 
3-4. Items from the Performance Part of the Wechsler-Adult 
Intelligence Scale Facing 54 
3-5. Performance Tests Found in the Arthur Point 
Scale Facing 55 
4-1. Illustrated Items from the Pintner-Cunningham 
Primary Test 83 
4-2. Profile for the California Test of Mental Maturity 85 
4-3. Sample Items from the Terman-\cNemar Test of Mental 
Ability (Form Gy 90 
4-4, Norms for Various Occupational Groups on the Army 
General Classification Test 98 
5-1. Sample Items from the Stanford Achievement Test, Primary 
Battery, Form k 107 
5-2. Profile for the Metropolitan Achievement Tests 109 
5-3. Sample Items from Metropolitan Readiness Tests 120 
5-4. Sample Multiple-Choice Items from the Cooperative 
Mathematics Test for Grades 7, 8, and 9 123 
5-5. Sample Multiple-Choice Items from the Cooperative Science 
Test for Grades 7, 8, and 9 124 
6-1. Sample Items from the MacQuarrie Test of 
Mechanical Ability 135 
6-2. Samples from Bennett Mechanical Comprehension Test 136 
6-3. Sample Items from the Differential Aptitude Tests 141 
6-4. Profile of a High-School Boy on the Differential 
Aptitude Tests 143 
6-5. Sample Items from the Minnesota Paper 
Form Board Test 145 
6-6. Sample Items from the Meier Art Judgment Test Facing 150 
7-1. Sample Items from Various Graphic Rating Scales 159 
7-2. Sample Items from the California Test of Personality, 
Elementary, Grades 4-5-6-7-8, Form AA 166 
7-3. Specimen Items from a Study of Value 173 


Contents 


8-1. Objective-Test Items Represented by 
Diagram 186-187 
9-1. Item Analysis Data for Test File 
10-1. Sociogram for 21 Kindergartners, 13 Boys and 8 Girls 235 
A-1. Frequency Distribution of the Forty Scores in Table A-1 242 
A-2. Histogram of the Forty Scores in Table A-1 243 


Picture, Drawing, or 


218 


CHAPTER 1 


MENTAL TESTS IN THE SCHOOLS 


The Teacher and Mental Tests 

The widespread use of standard tests in today’s schools ren- 
ders it increasingly necessary for the classroom teacher to be 
familiar with these devices, with what they are and what they do. 
Teachers are often required to administer and score tests and 
frequently to use these scores in the evaluation of pupil capa- 
bilities and future promise. This is essential, of course, if the 
standard test is to have value in the work of the school. Most 
teachers, however, have no desire to become testing specialists 
or psychometricians, and many have little knowledge of modern 
statistical method. For these reasons, books dealing chiefly with 
the statistics of test construction and with other technical prob- 


1 


2 Mental Tests in the Schools 


lems, while a necessary part of the training of school and clinical 
psychologists, are often not very useful to the teacher. In fact, 
they may leave him more confused than enlightened. 

This book is planned to present a comprehensive account of 
standard tests for teachers and for others not planning to become 
specialists in this field. It is not a book on statistical method, it 
does not deal broadly with the history of testing, nor with the 
applications of tests to problems of business and industry. Instead, 
it describes the various sorts of test, their uses and abuses, and how 
they supplement and aid the work of the classroom. Statistical 
terms necessary to an understanding of the tests themselves are 
defined and illustrated, but detailed calculations are not included 
in the text. The book’s usefulness will be enhanced if the exer- 
cises and topics at the ends of the chapters are carefully worked 
through. It is highly desirable, too, that the instructor have the 
class examine, take and score a number of tests. The discussion 


in a chapter will be clarified when there is actual familiarity with 
the tests described. 


What Mental Tests Are 


In a mental test, the examinee is confronted with a variety of 
tasks—questions to be answered, problems to be solved, direc- 
tions to be followed. Answers m 


ay be given orally, in writing, 
and sometimes by marking 


or manual manipulation, as, for 
example, by fitting blocks into apertures. Mental tests differ from 
physical tests, though there is considerable overlap in the two 
sorts of measurement. Both varieties of test require previous 
learning, and both present problems, but the mental test—to a 
greater degree than the physical—demands verbal abstraction 
rather than action, ideas rather than muscles. Tests of physical 
fitness—of height, weight, and physical strength, for example— 
differ most markedly from mental tests; in other words, are most 
physical. Tests which require speed and accuracy of hand-eye 
or hand-ear co-ordination, which demand manual dexterity and 
skill (called sensory-motor tests) are both mental and physical. 


A Classification of Mental Tests 3 


But none of these tests is as “mental” as is the intelligence test 
or school examination in algebra or history, since none of them 
depends to so large a degree upon verbal symbols. 

The term rental test is sometimes restricted to the measure- 
ment of intelligence or aptitude, examinations in school subjects 
being classified as educational achievement tests. The reasoning 
here is that the mental test—the intelligence test, for example— 
tells us how much a child can learn, whereas the school examina- 
tion tells us what he has already learned. To some extent this is 
true. But the distinction between the two sorts of measurement 
is one of degree rather than of kind. No mental test measures 
potential ability except by way of performance. We possess no 
microscope by which we can discover the inherited qualities of 
a child's brain or nervous system. The general intelligence test, 
to a greater degree than the school examination, measures poten- 
tial ability: because it draws more upon native alertness than upon 
routine school learning. But the school examination also draws 
upon native alertness as expressed in school learning, and both 
sorts of test demand the use of symbols—words, diagrams, 
numbers, pictures. Accordingly, in this book the term ental test 
will be used to describe both sorts of examination. 

The primary objective of a mental test is to detect individual 
differences—that is, to discover how one child compares or 
“stacks up” against another child of the same age, sex or grade 
classification. This knowledge, as we shall see later, is useful in 
many ways in school and out. A second objective of the mental 
test is to discover intra-individual differences or the variations 
in performance within an individual. The scores made by an 
examinee, when put in comparable units and represented on a 
profile, provide a useful record of the examinee’s strengths and 


Weaknesses. 


A Classification of Mental Tests 


In beginning the study of mental tests, it will be helpful to 
draw up a list of the different varieties of tests. Most widely used 


4 Mental Tests in the Schools 


tests are standardized for procedure and results. A standardized 
educational achievement test, for example, is one that has been 
constructed in accordance with the best principles of test making 
and has been administered to hundreds of pupils in those grades 
for which the test is suitable. Results from standard tests are 
expressed as norms. These are typical scores earned by large 
groups of children believed to be representative of various ages 
and grades. For example, a score of 45 on a standard reading test 
may be the norm for children 9 years, 6 months old; or for chil- 
dren who are just beginning the fourth grade. 

The following outline gives some notion of the field to be 


covered and at the same time furnishes an overview of the 
chapters to follow. 
VARIETIES OF MENTAL TEST 
I. Intelligence Tests 
(1) individual: administered to one examinee at a time 
(2) group: administered, like a school examination, to many 
examinees at the same time 
(3) performance: make little or no use of language, in con- 
trast with the paper-and-pencil tests in (1) and (2) 
I. Educational Achievement Tests 


(1) survey: comprehensive examinations used to determine 
general academic standing 

(2) subject: examinations in specific fields—for example, 
physics, Spanish 

(3) diagnostic: cover a wide range of academic skills (in 
reading or arithmetic, for example) and are designed to 
reveal specific weaknesses and strengths 


HI. Aptitude Tests 
(1) general: for example, of 
(a) mechanical abilit 
(b) clerical ability 
(2) special: aptitude for school subjects—for example, chem- 
istry or foreign languages; differential aptitudes 
(3) professional: for example, in 
(a) law 
(b) medicine 
(c) engineering 
(d) teaching 


The Beginnings of Mental Tests 5 


(4) talent: aptitude in such fields as 
(a) art 
(b) music 
IV. Tests of Various Aspects of Personality 
(1) personal adjustment questionnaires: surveys of worries, 
fears, social inadequacies 
(2) attitude surveys: upon, for example, social, economic 
and political questions 
(3) inventories of interests as related to various occupations 
(4) environmental factors related to personality: question- 
naires covering socio-economic background and other 


variables 
(5) projective techniques: subtle and indirect measures of 


dominant personality trends 


All these mental tests will be treated in subsequent chapters. 
The following sections of this chapter provide a brief outline of 
the development of psychological testing in order to clear the 
ground for later work. For a more complete discussion of the 
historical development of mental tests, the student should consult 
references at the end of this chapter. 


The Beginnings of Mental Tests 


Interest in psychological testing developed in Germany and 
France about the middle of the last century. This interest grew 
out of the acute need for a better understanding of feebleminded- 
ness and the various forms of insanity. Tests were devised for 
the purpose of determining what the feeble-minded person can 
learn, bow wach he can learn, and in what respects he differs 
most drastically from the normal. In the case of the insane and 
the mentally deteriorated, brief tests were drawn up for assessing 
loss of memory, distortions of perception, distractibility, mental 
fatigue, and changes in such sensory-motor functions as speed 
and accuracy of motor responses. 

In England, interest in mental testing arose from the study of 
individual differences in mental and physical functions. The 
leader in this movement was Sir Francis Galton, an eminent 


6 Mental Tests in the Schools 


geneticist, who set up a testing laboratory in London in 1882. 
Here, for a small fee, a person could have the keenness of his 
vision and hearing tested, as well as his muscular strength and 
his speed and co-ordination of response. Galton’s tests were quite 
brief and sampled rather narrow: aspects of behavior. In fact, 
they were sensory-motor rather than strictly mental in char- 
acter. One of the first American psychologists to become inter- 
ested in mental testing was James McKeen Cattell. Cattell in- 
troduced mental tests of the Galton type in this country at the 
turn of the century. 


Intelligence Tests: Individual 

The individual intelligence test as we know it today grew out 
of the work of Alfred Binet, a French psychologist, who was 
director of the laboratory for physiological psychology at the 
Sorbonne. In 1904 Binet was asked to devise a mental test suit- 
able for use in detecting slow learners in the schools of Paris. 
The test was to be used not only to sift out the subnormal chil- 
dren in the grades but also to provide a better understanding of 
degrees of feeblemindedness, with a view to improving the 
education of these children. In 1905, Binet, with a collaborator, 
Theophile Simon brought out the first scale for measuring intel- 
ligence. This scale consisted of thirty problems and questions 
arranged in order from easy to hard. A second edition of Binet’s 
Scale appeared in 1908, and a third and final edition in 1911. 
These tests differed sharply from those of Galton. Binet was 
interested in determining the intellectual level of school children, 
not (as was Galton) in studying differences among individuals 
in fairly narrow mental and sensory-motor functions. In order 
to measure intelligence, Binet believed he must get tests which 
would measure a child’s memory, his comprehension and judg- 
ment, and his insight. He avoided questions which demanded 
specific and routine school learning. For example, instead of 
asking the examinee the product of 6 x 3 or the name of the 
largest city in France, Binet asked the child to repeat four digits 


Intelligence Tests: Group 7 


(single numbers) or the words of a sentence (heard only once); 
to tell the “thing to do” in specific problem situations; to 
criticize (“see through”) an absurd statement or fallacy; to give 
differences between, for instance, a president and a king; to 
define abstract words like justice and loyalty. 

Binet’s famous tests became the basis for the widely used 
Stanford Revision of the Binet-Simon Scale, described in Chapter 
3. To Binet belongs the credit for having set up the first “age 
scale’”—that is, a test series in which items are arranged or 
grouped by age levels. A child’s “score” on an age scale is deter- 
mined by the level attained and is expressed by a mental age 
(MA), which denotes the child's maturity. 

Children of preschool age are unable to do tests which require 
reading and word knowledge. For these children, therefore, as 
well as for children handicapped in speech, vision, or hearing, 
and for the non-English speaking, performance tests must be 
used. In a typical performance test, the child is asked to identify 
common objects, string beads, build towers of blocks; or he may 
be asked to fit blocks into cutouts, arrange pictures in sequence, 
match the colors of cubes. Performance tests have been devised 


for use with illiterate and less intelligent adults as well as with 


children. 


Intelligence Tests: Group 

When intelligence tests are administered to large groups of 
examinees at the same time, they are appropriately called “group 
tests,” The first group tests were developed (in 1917) during 
World War I. Together with other information, these tests 
were used (1) in accepting or rejecting men, (2) in the classifica- 
tion of those accepted, (3) in the assignment of draftees to 
various types of service, and (4) in determining admission of 
candidates to officer training schools. There were two kinds of 
group test, called Army Alpha and Army Beta. The first was 


intended for soldiers who could read and write; it required that 


an examinee follow fairly involved directions, solve “mental 


8 Mental Tests in the Schools 


arithmetic” problems, know the meanings of words, and perceive 
relations (for example, in an analogies test the question might 
be as follows: Hand is to foot as glove is to a 2 __). Army Beta 
was a non-language or non-verbal test. It made use only of 
diagrams, pictures, and numbers and was answered by a simple 
system of marking. Army Beta was administered to the illiterate 
and the foreign-born. Directions were given in pantomime for 
the benefit of those soldiers who did not understand English. 

During World War II, a group intelligence test called the 
Army General Classification Test (AGCT) was administered to 
some 12,000,000 men. AGCT is a verbal or language test. It 
includes three sorts of materials: verbal (vocabulary), numerical 
(arithmetic problems), and spatial (for example, problems in 
spatial relations presented by pictures of block piles to be 
“counted” by the examinee). No specific “school” questions 
were asked since the test was designed to measure mental alert- 
ness in dealing with symbolic materials apart from specific train- 
ing. Both Alpha and AGCT are still used in the testing of adults. 

Between World Wars I and II, scores of group tests of intel- 
ligence were constructed and used widely in the schools and col- 
leges. These and other mental tests ( aptitude, personality) have 
been widely employed in business and industry as an aid in the 
selection and placement of personnel. 

In most group intelligence examinations, items are answered 
by marking one of several possible solutions (multiple-choice), 
by selecting one of two answers (true-false), and by checking or 
underlining the appropriate reply among several options. These 
answer techniques are called “objective” (p. 185), because in 
scoring such tests the judgment of the examiner does not enter 


in—or does so to a very slight degree. Group tests of intelligence 
are treated in Chapter 4. 


Educational Achievement Tests 


Since World War I, a number of tests of educational achieve- 
ment have been constructed on objective principles. These tests 
are used to determine general educational level or standing, as 


Aptitude Tests 9 


well as knowledge of a given subject field—as, for example, 
geometry or French. The general survey test, when used in the 
elementary school, is a comprehensive examination of the stu- 
dent’s knowledge of reading, spelling, arithmetic, grammar and 
literature, history and elementary science. Tests in separate 
subjects—history or physics, for example—are also available at 
educational levels from the secondary school to college. Educa- 
tional achievement tests are called diagnostic when they are used 
to reveal a student’s weaknesses in a particular area such as 
arithmetic or reading. Diagnostic tests must of necessity cover a 
wide range of information and skills in a given subject. Educa- 
tional achievement tests are described in Chapter 5. 


Aptitude Tests 


Tests designed to discover whether a student is “gifted” in 
music or mathematics, say, or whether a young man has the 
knack for dealing with tools and mechanical contrivances are 
called aptitude tests. Aptitude may be inferred (1) from the 
degree of mastery attained in a “new” subject after a period of 
study. Aptitude for a foreign language, for instance, is demon- 
strated in the ease with which the subject (Spanish, for example) 
is acquired after a term’s work. Achievement tests, given after a 
period of “exposure,” reveal this aptitude directly. Aptitude is 
also inferred before a period of study by testing (2) to see 
whether an examinee possesses those abilities and skills judged 
to make for success in a given subject (for example, physics), 
or in a profession (for example, medicine or law). Aptitude for 
physics is gauged by finding how well the student has learned 
the mathematics necessary for work in physics; aptitude for law 
is judged by the student’s ability to read difficult prose, compre- 
hend fairly involved legal arguments and follow a line of 
reasoning to a conclusion. What are called “differential aptitude 
tests” are designed to assess & student’s strengths and weaknesses 
in certain fundamental abilities believed to be crucial in a 
number of activities—in and out of school. 

Tests of general mechanical aptitude sample performance in a 


10 Mental Tests in the Schools 


number of activities believed to demonstrate mechanical knowl- 
edge and skill. Factors measured by these tests include familiarity 
with tools, insight into mechanical relations (pulleys, levers, and 
the like), ability to solve problems expressed in diagrams of 
machines and mechanical contrivances, and interest in mechan- 
ical things, as shown by the reading of popular science, building 
radios, tinkering with cars and so on. Manipulative tasks and 
mechanical gadgets have been employed to test for special 
abilities in a ‘variety of situations. Among the traits studied are 
manual dexterity, sensory-motor skills, visual and auditory 
acuity, all of which are needed in many 
the armed forces. 

Clerical aptitude tests cover the knowledge and skills needed 
in a business office. Tests under this he 
which we can predict an examince’s ability to carry out the 


written work of an office—to spell, check records, read and 
write easily and accurately. 


Aptitude tests of a speci 
talent in art and music. 
factors needed for succes 
rapid and accurate re 
harmony 


jobs in industry and in 


ad provide scores from 


al sort have been devised for inferring 
In music, for instance, many of the 
s can be measured: “car” for music, 
ading of music at sight, a 
and other technical phases of musi 
color, form, symmetry and other 
mined by comparing a student’ 
acknowledged experts. 

talent in art or music is 
and vocational guidance. 


Aptitude tests are treated in Chapter 6. 


nd knowledge of 
c. In art, “taste” for 
artistic dimensions are deter- 
s judgments with those of 
whether a person possesses 
hly important in educational 


Knowing 
often higl 


Personality Tests 


Psychologists have used the 
termine personality factors in 
ment, (b) attitudes, and (c) in 
have been used in the social s 
home, and community phen 


questionnaire or inventory to de- 
three areas: (a) personal adjust- 
terests. In addition, questionnaires 
ciences to survey 


socio-economic, 
omena. “Tests” 


of personality are 


Personality Tests 11 


in reality standard interviews designed to reveal characteristic 
ways of behaving. The personal adjustment questionnaire or 
personal data sheet inquires into a person’s fears, worries, 
anxieties, and home and work adjustments. Such inventories are 
often appropriately called “trouble sheets.” In some cases, the 
questions are direct and undisguised: “Are you afraid of high 
places?” “Can you stand the sight of blood?” “Do your parents 
treat you right?” In other adjustment inventories, questions are 
disguised and indirect, so that the intent of the question may not 
be understood by the examinee. A technique often used in such 
inventories is that of “forced choices” (p. 168). 

Attitude questionnaires attempt to reveal systematic ways of 
behaving or thinking about social, religious, or political matters. 
Can a student be classified as narrow- or broad-minded, religious 


or irreligious, or somewhere between these extremes? Attitude 


inventories try to answer these questions. 

Interest inventories survey 2 person’s interests in books, sports, 
people, occupations, social activities, and the like. An examinee’s 
pattern of interests may serve to identify him with some well- 
defined occupational group—for example, lawyers or chemists. 
s interests may identify him with some area of 


Or a young man’s 
interests, such as science, business, or social service. Interest tests 


are especially valuable in counseling, since interest, as much as 


ability, may determine a student's educational or vocational 


choices. 
Another group of personality tests makes use of what has been 


called “projective” techniques. Projective tests are disguised 
aminee is asked what he “sees” in some 
blot or a picture, for example. These 
ul in the diagnosis of disturbed mental 


interviews in which an ex 
neutral situation—an ink 


tests are perhaps most usef 
states. They must be administered by an expert and are employed 


mostly by psychiatrists and clinical psychologists in severe be- 


havior problems. 
The techniques of the personality questionnaire have been 


widely used in polls conducted to assess public opinion about 


12 Mental Tests in the Schools 


such things as political issues and social questions. Inventories 
have been employed, too, to survey systematically the association 
between items in a constellation of attitudes or opinions—for 
instance, between social and economic background factors, 
preferences for political candidates, etc., and views about social 
and economic issues. In sociological studies in which environ- 
mental factors loom large, the kind of home from which a child 
comes, the educational and occupational status of the parents, 
and the character of the community may be revealed by a 


systematic survey of background variables. Personality tests are 
treated in Chapter 7. 


How Mental Tests Are Used in the Schools 


“As we have said, the primary function of the mental test is to 
reveal individual differences, More specifically mental tests are 
useful to the teacher in three ways. First, mental tests aid in 
the evaluation of class performance in relation to established 
norms (p. 115). Second, tests reveal the strengths and weaknesses 
of individual pupils, that is, are useful in educational diagnosis 
(p. 116). Finally, tests enable the teacher to discover whether 
a pupil possesses aptitude for a given subject or course of study, 
and to predict his probable success in college or professional 


school. We shall consider these three objectives in the chapters 
to follow. 


SUGGESTIONS FOR FURTHER READING 


Comprehensive accounts of the deve 


the application of tests in variou 
below. 


lopment of mental testing and of 
s areas will be found in the references 


Anastasi, A. Psychological Testing. New York: Macmillan, 1954. 
Freeman, F. S. Theory and Practice of Psychological Testing. (Rev. 
ed.) New York: Holt, 1955. 


Ross, C. C., and Stanley, J. C. Measurement in Today’s Schools. (3rd 
ed.) New York: Prentice-Hall, 1954, 


Thorndike, R. L., and Hagen, 


. Elizabeth. Measurement and Evaluation 
in Psychology and Education, N 


ew York: Wiley, 1955. 


CHAPTER 2 


STATISTICS IN MENTAL TESTING 


The purpose of this chapter is to acquaint the prospective user 
of mental tests with those statistical terms and techniques most 
often used in testing. Stress throughout the chapter is on the 
Meaning and significance of symbols and terms rather than on 
the mechanics of computation. For the latter, the student should 
Consult the Appendix as well as the books on statistical method 
listed at the end of this chapter. 

Perhaps the best advice one can offer the teacher who is plan- 
Ning to use mental tests is that he first take a course in statistics. 
For students who have been wise enough to do so, the present 
treatment will constitute simply a brief review and summary. 
And for those who have had no statistical training, it will pro- 


13 


14 Statistics in Mental Testing 


vide the minimum essentials for the understanding and evalua- 
tion of mental tests themselves. 


THE FREQUENCY DISTRIBUTION 


Drawing Up a Frequency Distribution 


Suppose that a teacher has administered a test of English 
grammar to fifty children in the seventh grade. The papers have 
been marked and the names and scores of the children recorded. 
Two questions ordinarily arise: (1) What is the typical per- 
formance of the class, and (2) What is the range of talent in 
the class? To answer these questions, we may organize and pre- 
sent the fifty scores in one of several ways. 

Table 2-1 is a systematic tabulation of the fifty English 
grammar scores into what is called a frequency distribution. 

The fifty scores have been arranged from high to low into 
sets of five under the heading “Scores.” In the frequency column 
headed “f” are listed the numbers of scores which fall into each 
sub-group. For example, five children score in the interval 60-64, 
eight in the interval 55-59, and so on down to four who score 
in the bottom interval, 30-34. 


A test score is always taken to represent the distance along 


TABLE 2-1 


Frequency Distribution of Fifty Scores 
On a Test of English Grammar 


Scores 
60 — 64 
55 


Graphic Representation of the Frequency Distribution 15 


some scale of ability running from low to high. Thus, a score 
of 46 covers the span from 45.5 to 46.5, 46.0 itself being the 
middle of the score interval. Other scores have the same meaning: 
in each case the score covers the distance .5 unit below to .5 
unit above the face value of the given score. This definition of a 


score means, of course, that the interval 30-34 begins at 29.5 
at 34.5 and ends at 


and ends at 34.5, that interval 35-39 begins a 
39.5, and so on. For convenience in writing, the intervals in 
Table 2-1 are the score limits rather than the exact limits. In each 


case, however, the exact limits of the intervals are understood. 


e Frequency Distribution 


resented graphically by 
2-1. In the construction 


Graphic Representation of th 


A frequency distribution may be rep 
a frequency polygon, as shown in Figure 


gon of Fifty Scores Achieved by 


FIGURE 2-1 Frequency Poly 
a Test of English Grammar 


Seventh-Grade Children on 


(Frequencies) 


(Scores) 


ofa frequency polygon, scores are laid off along the baseline, or 
X-axis, at equal intervals, and the frequencies (f’s) are plotted 
on the vertical or Y-axis. Each f is plotted directly above the 
midpoint of the interval upon which it falls. The four scores 
falling in the first grouping, 30-34, are plotted above 32, the 
midpoint of the interval. In the other intervals (reading up), 5 


16 Statistics in Mental Testing 


scores are plotted above 37, midpoint of 35-39, 6 above 42, 
12 above 47, and so on. The points are joined with short straight 
lines to give the outline of the frequency polygon. 

A frequency polygon shows graphically how the scores are 
spread over the test scale from low to high. From Figure 2-1 it 
is apparent that more children scored in the middle of the scale 
(see, for example, the 12 on interval 45-49) than at either 
extreme. Rules for constructing a frequency polygon so as to 
provide a good picture of the test data will be found in the 
Appendix. 

Another way of representing a frequency distribution graph- 
ically is the histogram. Figure 2-2 represents the f’s on the score 


FIGURE 2-2 Histogram of Fifty Scores Achieved by Seventh- 
Grade Pupils on a Test of English Grammar 


15f 
8 104 
3 
16 
D- 
e 
=. 5 
HH 
30 40 50 60 70 
argi 
(Scores) 


intervals by small rectangles set up over each interval. For the 
first interval, the rectangle is four Y-units high, and for the 
second interval five Y-units high, and so on. The highest rec- 
tangle, 12 units on the Y-axis, is above interval 45-49, 

The histogram and frequency polygon represent the same 
facts, and there is little to choose between them. Frequency 
polygons are to be preferred to histograms when two distribu- 
tions are plotted on the same axes, since in the histogram the 


The Normal Curve 17 


vertical and horizontal lines often coincide, making the figures 
dificult to disentangle. 


The Normal Curve 


The symmetrical bell-shaped graph shown in Figure 2-3 is the 
well-known normal curve. This “ideal” frequency polygon is 


FIGURE 2-3 The Normal Curve 


the mathematical model to which many distributions of actual 
Scores approximate. (See, for example, Figure 2-1.) The normal 
curve is often called the normal probability curve because it 
shows the probability of occurrence of scores of different size, 
when these are determined by a large number of independent 


and randomly combined factors. : 
The normal curve has played an important role in the develop- 


ment of mental measurement. Among its uses in testing may be 
Mentioned the following: 
1. Selecting the Items of a Test. When the distribution. of test 


scores for a class is badly off-center or “skewed,” as shown in 
Figures 2-4 and 2-5, the test is not suitable for the group. In 


18 Statistics in Mental Testing 


Figure 2-4 the test is too casy—there are too many high scores; 
and in Figure 2-5 the test is too hard—there is a disproportionate 
number of low scores. When the test maker takes the normal 


FIGURE 2-4 Negatively Skewed Curve 


High 


FIGURE 2-5 Positively Skewed Curve 
c 
firs 
3 
zz 
ae t H t 
EE pas: 
Ea; 
HiHlowH Er t High 
Choco ett +t 


curve as his model, questions and problems are carefully selected 
and their scoring adjusted to give a symmetrical arrangement of 
test scores like that of the normal curve. This means that a 
majority of pupils score at the middle of the scale, a smaller 


The Normal Curve 19 


number scoring at the high and low ends. Note that, according 
to the criterion of normality, the frequency polygon in Figure 
2-1 shows the English grammar test to be generally satisfactory, 
though perhaps a bit too easy. 


2. Scaling the Obtained Scores from a Test. Raw or obtained 
scores from a test are usually expressed by an arbitrary number 
of points. Scores of this sort do not represent equal steps or equal 
units along some ability scale; and since there is no zero point, 
a score of 40 is not twice as good as a score of 20. When point 
scores are transformed into deviations from the average or mean, 
and expressed in units of the standard deviation (page 36) of 
the group, they are called sigma-scores. The unit of deviation 
(the standard deviation) is usually represented by the Greek 
letter æ (sigma). Sigma-scores may later be converted into 
standard scores (page 38). Many educational achievement and 
aptitude tests publish norms (page 40) in terms of standard 
scores. These scores are comparable from test to test when dis- 
tributions are normal, or approximately so. 

Point scores may be changed over directly into equal-unit 
scores in a normal distribution. Such “normalized” scores have 


several advantages (page 40). 


3. Determining the Stability of a Test Score. An obtained score 
on a test—for example, a group test 1Q—can be expected to 
vary somewhat up oF down when the test is administered a 
second time. The variation to be expected in a score, that is, its 
probable stability, can be predicted from tables of the normal 


probability curve (page 23). 
AVERAGES 


After a frequency distribution has been tabulated, we are 
ready to compute a typical measure or average. There are three 


sorts of averages—also called measures of central tendency—in 


common use. 


20 Statistics in Mental Testing 
The Mean (M) 


Given a set of ten scores, 10, 9, 10, 12, 8, 6, 4, 7, 5, and 4, the 
mean is simply 7.5—found by adding the scores (75) and dividing 
this sum by their number (10). The M is popularly called the 
average. When scores have been grouped into a frequency dis- 
tribution, as shown in Table 2-1 on page 14, a slightly different 
method is employed in finding the M. (See Appendix.) But the 


M is always essentially the sum of the scores divided by their 
number. 


The Median (Mdn) 


When scores are arranged in order of size, another sort of 
average, the median (Mdn) is the point in the distribution found 
by counting off one-half of the scores from either end of the 
series. We usually start with the low end. For example, for the 
five scores, 7, 8, 9, 10, and 12, the median or mid-score is 9: there 
are two scores above and two below it. When the number of 
scores is even—for example, 5, 7, 8, 9, 10, and 12—the median 
is midway between the two middlemost scores, namely, at 8.5. 
There is no mid-score, When scores are grouped into a frequency 


Table 2-1 on page 14, the median is 


The Mode 
That score in a set of scores which occurs most frequently is 
called the crude mode, or the mo 


largest frequency. 
but usually we simply 
€ crude mode without 


The Range 21 


measure of central tendency. For exploratory purposes it does 
not need to be computed so precisely as the mean or median. 


MEASURES OF VARIABILITY 


The Range 


It is sometimes more important to know the variability of a 
set of scores than to know the mean or median. Suppose, for 
example, that two sections of Grade 7 have the same mean but 
differ markedly in spread of talent, as evidenced in the variability 
of scores around the mean. Figure 2-6 shows two distributions 


FIGURE 2-6 Two Distributions with the Same Mean but 
Differing Markedly in Range ( Variability 


20 40 M 60 80 


from 40 to 60, whereas the 

80. The difference between the high 

bution is 20 points; in the B distribu- 

most general index of variability. 
the standard deviation (written 
jation (written as Q). 


of this sort: the scores in A range 
Scores in B range from 20 to 
and low scores in the A distri 
tion 60 points. The range is the 
Other more exact measures are 


22 Statistics in Mental Testing 


The Standard Deviation (o) 


The mean of the set of five scores—12. 10, 8. 6, and 4—is 8. 
If 8 is subtracted from each score, we have 12 — 8 = 410-8 = 
2,8 — 8 = 0,6 — 8 = — ?, and 4 — 8 = — 4, The size of a 
deviation from MH tells the extent to which the individual score 
deviates from the common mean; and the sign of the deviation 
indicates its direction from M. If each deviation is now squared, 
we have 4° = 16, 2? = 4, (—2)? = 4, and (—4)2 = 16. The 
square of 0 is, of course, 0. The sum of these squared deviations 
is 40, and ø, the standard deviation, is defined as 


= (deviations)? 
"NN N 


or, in our example, o = /40/5 = V8 
= 2.83" 


Squaring the separate deviations around the M eliminates the 
minus signs and gives extra weight to extreme deviations. A SD 
or © is judged to be large or small (to reflect much or little varia- 
tion) in relation to other SD’s computed for the same test. For 
example, if 35 boys and 42 girls have the same M on a history test 
but the boys’ © is 10 and the girls’ © is 6, we know that the boys 
scores spread more than the girls’ up and down the scale—in 
both directions from the mean. 

In a normal curve, o provides valuable information concern- 
ing the way in which the Separate measures fall around the 
common mean. In Figure 2-7, for example, 3o is seen to include 
virtually all the measures above the M, and —30 all of the meas- 
ures below the M. The total area of the normal curve is taken 
as N. From tables of the area of the normal curve, we know that 
between Mand 10 are approximately 34 per cent of the measures 
(actually 34.13 per cent); and between M and —1o are also 
34 per cent of the measures, The two “halves” of the curve are 
equal. Hence we find about 68 per cent of the measures—roughly 


* For calculation of ø from a frequency distribution, see Appendix. 


The Standard Deviation 23 


FIGURE 2-7 Areas Under the Normal Curve 


1 


two thirds between M and +10. Furthermore, from tables we 
find that 14 per cent of the measures fall between 10 and 20 in 
the normal curve and about 2 per cent between 20 and 30. The 
same proportions hold, of course, for the half of the curve to 
the left of the M, since the M divides the area of the normal 
curve into two equal parts. 

The relations of © to the total area (N) in the normal curve 
model hold approximately for distributions which resemble the 
normal curve in form. An illustration will make clear how the 
normal curve model is used in such cases. Suppose that on a 
reading test administered to sixty children in the fifth grade, 
the M = 62 and o = 8; and suppose further that the frequency 
polygon of these scores closely resembles the normal curve in 
form. Taking the normal curve as our model, then, we can say 
that approximately two thirds of the scores (that is, forty) fall 
between 54 and 70 (62 8). Moreover, about 14 per cent of 
the scores, or about 8, will fall between 70 and 78 (between 
lo and 20), and about 2 per cent or 1 or 2 will fall between 
78 and 86 (that is, between 20 and 30). In the lower half of 
the distribution, 14 per cent or 8 scores, will fall between 54 


24 Statistics in Mental Testing 


and 46, and 2 per cent between 46 and 38. These relationships 
are shown in Figure 2-8. Note that M, the reference point, 
is 62 and that ø is 8 units on the test scale. 


FIGURE 2-8 Use of Normal Curve Model to Show Distribution 
of Sixty Scores on a Reading Test 


aH ; ' ; 


34% H34% : 


4% 14% 


2% 2% 


The Quartile Deviation (Q) 


Just as we compute the median by counting off 50 per cent 
of the scores, so we can count off 25 per cent of the scores 
from the low end of the distribution (that is, 25 per cent of 
N) to locate Qu, the first quartile point. Similarly, we can count 
off 75 per cent of the scores from the low end of the distribution 
to locate Qs, the third quartile. The gap between Qs and Q: 
is called the interquartile range, or range of the middle 50 per 
cent. Q, the semi-interquartile range, is computed thus: 


_ Q—QI1 
= 2 
Like o, Q is a measure of variability but, unlike ø, it is found 


by counting into the distribution, whereas o is computed from 
the squared deviations taken around the M. When the median 


Q 


The Quartile Deviation 25 


1s the measure of central tendency, we generally use Q; when 
M is the measure of central tendency, we use o. 

Methods of computing Q from a frequency distribution will 
be found in the Appendix; here we are concerned primarily 
with the meaning of Q as a measure of variability. Q’s useful- 
ness will become clearer when we have computed the percentile 
Curve, or ogive, as shown in the next section. 


PERCENTILES AND PERCENTILE RANK 


Table 2-2 shows the frequency distribution of Table 2-1, 


with the addition of two columns in which the f’s have been 
cumulated. 


TABLE 2-2 


Frequency Distribution and Cumulated Frequencies of 
Fifty Scores on an English Grammar Test 
Data are the fifty scores in Table 2-1. 


a a 


a) (2) G) (4) 

Scores f cum.f % cum.f 
60 — 64 5 50 100 
55 — 59 8 45 90 
50 - 54 10 37 74 
45 - 49 12 27 54 
40 - 44 6 15 30 
35 — 39 5 9 18 
30 - 34 4 4 8 


In column (3), scores have been added progressively—cumu- 
lated—from the bottom to the top of the distribution. On the 

tst interval, 4 is the entry; 4 + 5 on the next interval gives 9; 
9 + 6 on the third interval gives 15; and so on. In column (4), 
these cumulated scores are expressed as percentages of N. In 

igure 2-9, cumulated f’s, in percentages have been plotted 
against the score-intervals laid off along the baseline. As scores 


26 Statistics in Mental Testing 


FIGURE 2-9 Ogive or Cumulative Frequency Curve 


100 qy 


30 40 50 60 ` 70 
Score scale (x) 


are added over each interval, each “ocum.f is plotted just above 
the upper limit of the interval upon which it falls. The resulting 
S-shaped curve is called an ogive, or cumulative frequency 
graph. The ogive constricts or expands the scale of scores into 
a scale of one hundred points, called a percentile scale. The 
median and the Q’s can be read from the ogive almost as accu- 
rately as they can be computed from a frequency distribution. 
To illustrate, if a line is run from the 50 per cent point on the 
Y-scale across to the curve, a perpendicular dropped from this 
point to the score-scale locates Mdn at 49 approximately. (The 
computed value is 48.66.) The twenty-fifth percentile, or Ql, 
is located from the ogive at 42 approximately; and the seventy- 
fifth percentile, or Q3, at about 55. Other percentile points (for 
example, Pas or Pes) can be located in the same manner by going 
from the appropriate point on the vertical percentage scale 


across to the ogive and dropping a perpendicular to the base- 
line. Note that the distance from Q3 to Q1 (that is, 55-42) 35 


The Quartile Deviation 27 


the interquartile range or range of the “middle 50.” One-half 
of this distance is 13/2, or 6.5, which is the quartile deviation, 
or Q. The larger the Q of a distribution, the greater the spread 
of the middle 50 per cent of scores along the scale and the larger 
the variability. 

A pupil's percentile rank (PR) is the position on the per- 
centile scale (on a scale of one hundred points) to which his 
score entitles him. Suppose that Tom Brown achieves a score 
of 40 on our English grammar test. What is his PR? Going out 
to 40 on the score-scale on the baseline, up to the curve, and then 
across to the Y-scale, we locate Tom's PR at about 20. This 
PR tells us at once that about 20 per cent of the pupils scored 
lower than Tom. If Mary Green scores 58 on the grammar test, 
her PR is read at approximately 84—and 84 per cent of the 
class made lower scores than she did. Scores achieved on tests 
expressed in different units—for example, a reading test and an 
arithmetic test—cannot be compared directly. But relative posi- 
tons (PR's) of a child in his classes can be quickly determined 
and compared when both sets of scores have been converted 
into a common percentile scale. Moreover, several PR’s may be 


Combined to give a general index. 


CORRELATION* 


The relationship between two sets of test scores can be de- 
Scribed mathematically by the coefficient of correlation between 
them. Correlation is expressed by a decimal fraction (called 7), 
Which may vary along a scale from .00 to +1.00. Let us suppose 
that tests in English grammar and in history have been admin- 
istered to the same seventh-grade class. Suppose further that 
children who score high in the English test tend to score high 
'n history, and that children scoring fairly high or quite low 
in English tend to score fairly high or quite low in history. When 
this happens, the coefficient of correlation between the two 


. . 5 
See Appendix for computation of a correlation coefficient. 


28 Statistics in Mental Testing 


sets of scores will be marked or substantial, for example, -60 
to .70. Now suppose that most pupils who score high in English 
grammar score only average in arithmetic. The correlation be- 
tween these two areas would then be lower—perhaps no more 
than from .20 to .30. If those pupils who score high in English 
grammar tend to earn very low scores on a test in shop work, 
the correlation here would be close to zero, or perhaps negative. 

Positive coefficients of correlation run from .00 to +1.00; 
good scores in the one test go with good scores in the other. 
Negative coefficients of correlation run from .00 to —1.00— 
denote inverse relationship—and good scores in the first test go 
with poor scores in the second. Zero correlation denotes just 
no correlation between two variables, 

Whether a correlation coefficient is to be regarded as high or 
low depends upon a number of factors. The correlation of 
height with weight in school children is generally high—around 
.70 for a given age level. The correlation of a good intelligence 
test and school grades will fall typically between .50 and «70; 
and the correlation between personality traits (from question- 
naires) and school achievement are usually low and often nega- 
tive. The following table will aid in interpreting coefficients of 


correlation: 
rsfrom .00 to .20 very low; negligible 
r’s from Æ.20 to + .40 low; present but slight 
7's from +.40 to = .70 substantial or marked 
r’s from Æ.70 to £1.00 high to very high 


When computing the correlation between two forms of the 
same test (the self-correlation of the test), we demand much 


higher 7’s than are found typically between different variables- 
The Reliability of a Test 


The Reliability Coefficient. One important application of cor- 
relation to mental testing is in the determination of the reliability 
of a test. Test reliability refers to the stability of test scores- 


The Standard Error of a Score 29 


If a child achieves a score of 48, for example, on a highly reliable 
test of general science, subsequent scores earned by this pupil 
upon equivalent forms of this test should not differ greatly from 
the initial score of 48. But if the test is unreliable, in repeated 
testing the score may vary widely from its first determination. 

The reliability coefficient of a test is found by computing the 
self-correlation of the test. Suppose that a reading examination 
has been given to five sixth-grade classes and that two weeks 
later the same test or an equivalent form is administered to the 
same classes. If the correlation between these two administra- 
tions of the test is high (a reliability coefficient of .90 or more 
1S considered high), we may feel confident that scores earned 
by pupils in this class are reasonably accurate measures of “true” 
ability. 

Test reliability is sometimes determined by repeating a test 
and correlating the second set of scores against the first set. This 
method is followed when there is only one form of the test. 
More often, an equivalent or parallel form of the test is given, 
and the reliability coefficient is the r between the test and its 
alternate form. The reliability coefficients of many standard edu- 
cational tests have been determined in this way. Other ways of 
determining test reliability will be found in the references at the 
end of the chapter. The authors of standard tests will usually 
Specify what method has been used in computing the reliability 


of their tests. 


The Standard Error of a Score 


The accuracy or precision of an individual score is perhaps 
best expressed by the standard error of a score, which is also 
Called the standard error of measurement. The SE (standard 


error) is calculated from the following formula: 
SE (score) = 0 V 1 — 1u 


Where o is the standard deviation of the test scores and rır is 
the reliability coefficient of the test. Suppose that the o of a 


30 Statistics in Mental Testing 


set of test scores is 10 and the reliability coefficient (711) is .95. 
Then the SE of a score on this test is SE = 10 V 1 — .95 or 2.2. 
This may be interpreted to mean that should a child take this 
test a second time, the chances are good (about 7 in 10) that 
his “new” score will not diverge by more than = 2 points from 
the original determination. The SE of a Stanford-Binet 1Q ia 
4.5 points for IQ’s from 90 to 110. In other words, if the test 
is repeated, we can expect a child’s IQ to stay within + to > 
points of its first determination. 

Reliability coefficients of standard intelligence and educa- 
tional achievement tests are generally above .90 for large groups 
of pupils. The size of the reliability coefficient depends upon 
several factors: the variability of the group, the length of the 
test, the method used in determining reliability. A reliability 
coefficient of .50 in a single grade or class may indicate as much 
stability of score as a reliability coefficient of .90 in a large 
group. The great advantage of the SE of a score is that it takes 
account of both the reliability coefficient and the variability (SD) 
in the group. (See page 56.) 


The Validity of a Test 


A mental test is a valid testing device when it measures what 
it claims to measure. Tests are not valid for all areas and all 
situations, but are valid in certain defined situations and for 
certain behaviors. A group intelligence test, for example, is not 
a valid measure of emotional control or of delinquent behavior. 
Validity may be classified, for convenience, into three sor! 
experimental, content, and predictive. The validity of an intelli- 
gence test is determined experimentally by 
correlation with various criteria: school 
alertness, and other measures of intel 
Many of the best tests of general int 
dated against the Stanford-Binet, the best known individual 
intelligence test (page 47). Aptitude tests—for example, those 


computing the test’s 
grades, ratings for mental 
lect, to mention a few. 
elligence have been vali- 


The Validity of a Test 31 


of clerical and mechanical aptitudes—are validated against dem- 
onstrated proficiency in office work or in mechanical tasks. Inde- 
pendent measures against which tests are validated are called 
criteria, Criteria do not represent entirely adequate or com- 
pletely sufficient determinations of a trait. Insofar as criteria 
Incorporate valuable aspects of the behavior we are studying, 
however, they represent variables with which a test, to be valid, 
must correlate positively. 

Tests of educational achievement in history, mathematics, 
languages, and the like possess content validity in that test 
questions sample the subject matter areas directly. Content valid- 
ity is not alone a sufficient index of a test’s usefulness. Such 
considerations as choice of items, extent of sampling, form in 
Which items are put, and level of difficulty are also very im- 
portant. But content validity is a necessary first step. Intelligence 
and aptitude tests possess content validity insofar as the items 
in them fulfill the author’s definition of what he is measuring. 
Such asserted or “face” validity, however, is never as convincing 
as is the content validity of the educational achievement test. 
Generally, tests of intelligence and of aptitude must depend for 
their validity upon correlations with independent criteria judged 
to be dependable indices of the trait under study. 

Predictive validity is the degree to which a test battery is related 
to some criterion of future performance or measure of success 
which will become available in the future. The predictive validity 
of a good group intelligence test for school performance ranges 
from about .40 to .60. (See page 96.) Many short tests have low 
Correlations with a criterion, but when put with other tests 
into a team combine forces to raise the correlation of the battery 
With the criterion, Validity coefficients do not run as high as do 
reliability coefficients, since no test can correlate higher with 
other tests than with measures of itself. 

Personality questionnaires, interest blanks, and attitude scales 
have content validity insofar as choice of items is concerned. 


32 Statistics in Mental Testing 


Such instruments are usually validated experimentally against 


objective expressions of interest, indices of neurotic behavior, 
and the like. 


Practical Considerations in the Choice of a Test 


There are a number of factors which enter into the choice 
of a mental test besides validity and reliability. Some of the 
more important are the following: 


1. Appearance: Is the test format good—are the items attrac- 


tively presented and arranged? 
2. Administration: How much time is rec 
score the test? What is the cost? 


3. Manual: Does the author give full accounts of reliability and 


validity—how found, upon what samples, of what sorts? 
Are instructions clear? 


4. Norms: Are the test norms readily interpreted? Are age and 
grade equivalents given? What type of scaling is used? 


Juired to give and 


SCALING TEST SCORES 


The purpose of scaling is (1) to revamp the raw test scores 
into a scale of equal units, and (2) to enable us to combine 
sub-tests into a single index. It is sometimes important (especially 
with aptitude tests) to compare relative performances, and this 
can be done only when tests are expressed in equal units. The 
score on a test when expressed simply in number of items done 
correctly is an aggregate of arbitrary points. Pupils can be 
ranked in order of merit for such aggregates, but such “scores” 


do not constitute a scale. There are several methods for scaling 
raw scores, 


The Age Scale 


When scores are put into MA (mental age) units, they form an 
age scale. Mental age is the chronological age which corresponds 
to or is typical of a given test score. The MA of 9-6, for example, 


The Percentile Scale 33 


represents the performance of the average child who is 9 years 
and 6 months old. Thus if Dick achieves an MA of 9-6 on the 
Stanford-Binet, this mental age is a measure of his intellectual 
Status or degree of mental growth. 

If Dick’s life age (CA) is 10-4, his IQ is 92. IQ = MA/CA 
and in our example is 114 mos/124 mos; the decimal is dropped. 
IQ is a measure of a child’s brightness relative to that of other 
children of his age. When MA and CA are equal, the IQ is 100, 
the brightness index of the average child. Dick’s IQ of 92 means 
that he is somewhat less bright than the typical child of his age 
level. IQ’s above 100 are achieved by bright children—those 
whose mental growth runs ahead of their years. IQ’s below 100 
indicate that a child is below normal, and very low IQ’s (70 or 
below) imply feeblemindedness. 

The age-scale is used in most individual intelligence scales and 
by Many group tests of general intelligence. The MA and IQ 
Were first widely used to measure performance on the Stanford- 
Binet test, which was constructed so as to meet the requirements 
Necessary to yield a constant ratio score, or IQ. Many group 
tests do not meet these requirements. It is wise, therefore, to 
accept IQ’s from group tests as tentative indices of brightness 
not always closely related to IQ from the Stanford-Binet. 


The Percentile Scale 

We have already seen (page 25) how obtained scores can be 
fitted into a scale of one hundred units to yield a percentile 
Scale. The PR (percentile rank) of a score—its position on the 
Percentile scale—can be computed from the frequency distribu- 
ton of scores. But the simplest plan is to plot an ogive (see Figure 
2-9, page 26) and read the PR from the graph. The PR of 
any score then becomes the percentage of the distribution which 
lies below the score. This method is not accurate beyond the 
first decimal, but it is sufficiently precise for many purposes. It 
'S easy to apply and requires a minimum of calculation. Table 2-3 
ives the frequency distribution of 180 scores on a clerical 


34 Statistics in Mental Testing 


aptitude test earned by students enrolled in several courses in a 
business college. 


TABLE 2-3 


Frequency Distribution of 180 Scores Achieved on a 
Clerical Aptitude Test 


a OE OA a ee ee | 


PR's of 
Scores Midpoints cum.f % cum.f midpoints 
194 — 196 195 180 100 99 
191 - 193 192 177 98 95 
188 - 190 189 167 93 85 


185 — 187 186 140 78 69 
182 — 184 183 - 110 


61 50 
179 — 181 180 70 39 29 


176 — 178 177 35 20 14 
173 - 175 174 17 9 
170 — 172 171 7 4 


The ogive in Figure 2-10 has been constructed from the 


% cum. f’s in Table 2-3 following the method outlined on page 


26. In the last column are entered the PR’s of the midpoints of the 
successive score-intervals. The midpoints reading down are 195; 


192, 189, 186, 183, 180, 177, 174, and 171. Any student who 


earns a score of 191, 192, or 193—who falls, that is, in the interval 


next to the top—receives a PR of 95, the PR of the midpoint of 
this interval. These midpoint PR’s constitute norms for the test- 
PR’s can be read with considerable accuracy from the ogive. 

PR’s have several advantages over raw scores. Suppose that 
a child has taken tests in arithmetic, science, English, history, and 
spelling. If his PR’s in each of these tests are known, they can 


be represented comparatively on a profile as shown in Figure 
2-11. 


This graph permits a comparison of the child’s achievement in 
the five subjects. It is clear that he is satisfactory in arithmetic 
(PR = 60) and science (PR = 55), above average in history 
(PR = 60), average in English (PR = 50), and below aver- 


The Percentile Scale 35 


FIGURE 2-10 Cumulative Frequency Polygon (Ogive) of 180 
Scores Achieved on a Clerical Aptitude Test 


20 t = HHR 


: : H H 

1 : $ Ht } H 

° : a H EHE F 
170.5 175.5 180.5 185.5 190.5 195.5 

‘Score scale (x) 


Profile of the Percentile Ranks in Various Subjects 


FIGURE 2-11 
for a Given Child 
A] 
eg E ne 
£53 23 
<8 = G 
H 
60 
a0 iaa tas 
HHHH 
HHH I Bi 
40 HE HEE - HEH 
HEHH HHHH 


36 Statistics in Mental Testing 


age in spelling (PR = 45). Comparisons of this sort cannot be 
made from raw scores. One disadvantage of the PR scale is the 
fact that units are not equal at the extremes of the scale. When 
PR’s are below 20 or above 80, they must be compared (or 
combined) with caution (see refs.). 


Sigma-scores and Standard Scores 


We have seen that one way of converting raw scores into 
a scale is by means of percentile ranks. Another method of 
scaling is to express the deviation of each test score from the 
common mean in units of SD, thus putting all scores into o-units. 
Such “deviation” scores are called o-scores and sometimes 2- 
scores. The following is an illustration of the method of con- 
verting obtained scores into o-scores, 

Table 2-4 gives the M’s and o’s earned by fifty sixth grade 
pupils on five objective educational achievement tests. At the 


bottom of the table are listed the scores achieved by two children, 
Mary and Howard. 


TABLE 2-4 


M's and o's Earned on Five Objective Tests of Educational 
Achievement Given in the Sixth Grade 


(1) Arith. (2) Arith, 


Reas. Comp. (3) Reading (4) Grammar (5) Science 
Mean 62 124 43 28 


o 10 20 4 
31 
26 


Mary’s scores 57 119 
Howard’s scores 62 144 


From an inspection of these scores, it is clear that Mary 15 
below the class mean in arithmetical re 
putation, and science, but is above the 


mar. Howard, on the other hand, is exactly on the mean in arith- 
metical reasoning, abov 


e the mean in computation and science, 
and slightly below the mean in reading and grammar. These com- 


asoning, arithmetical com- 
mean in reading and gram- 


Sigma-scores and Standard Scores 37 


parisons are useful, but because of differences in the units in which 
test scores are expressed, we cannot (1) compare Mary’s and 
Howard’s scores in the several tests, except to point out that 
they are above or below the mean, nor (2) combine either pupil’s 
Scores into a single meaningful index of academic achievement. 
Conversion of test scores into o-units will permit us to carry 
out both these operations. 

The formula for a o-score or z-score is 
(X — M) x 
——— 0 


rZ=— 


o o 


where (X — M) = x. Mary earned a score of 57 in arithmetical 
reasoning. This score deviates —5 points from the mean 
(57 — 62 = —5). If we divide this deviation of —5 by 10 (the 
©), we have —.50 as Mary’s o-score in arithmetical reasoning. 
In Test 2, arithmetical computation, Mary’s o-score is (119 — 
124) /20 or —.25. Her other o-scores are computed in the same 
Way; those that are plus are above the mean, those minus below 
the mean. Mary’s five o-scores are shown below: 


Test: (1) (2) (3) (4) (5) 
Mary’so-scores —.50  —.25 1.00 75 1.25 


Howard’s o-scores are found as were Mary’s. In Test 1, his 
Score of 62 is exactly on the mean, and his o-score is .00. In 
Test 2, arithmetical computation, Howard’s -score is (144 — 
124) /20 or 1.00. Howard’s scores are below the mean in tests 
3 and 4, and his o-scores are minus. His scores are tabulated 
below: 

Test qd) = (2) (3) (4) (5) 
Howard’s o-scores .00 1.00 —.29 —.50 38 


It is apparent that o-scores are simply plus or minus deviations 
from the test mean expressed in © units. A practical disadvantage 
Of such scores is the fact that they are small decimal fractions 
and are about as often + as —. For greater convenience, there- 
fore, o-scores are usually converted into a distribution of 


38 Statistics in Mental Testing 


standard scores with an assigned Af and o. M's and o's ae) 
selected are M = 100, o = 20, M 500, © 100, M 10, 
= 3, 


If Mary’s and Howard’s scores are converted into a ie V 
ibuti í = -< hav e 
score distribution with 1f = 100 and o = 20, we have t 
following: 


Tests G) Q) 6) ( (5) Total Mean 
Mary’s standard 


scores 90 95 120 115 75 495 99 
Howard’s standard 5 
scores 100 120 94 90 108) 512 102 


In the first test, Mary’s o-score is —.50, or —.50 of & below the 
M. In our new distribution (M = 100, © = 20), the equivalent 
standard score is one-half of o below 100, or 90, In Test 25 
Mary’s standard score is % of © below the mean of 100 or 95 
(% of 20 is 5). A formula for converting obrained scores directly 


into standard scores with a M = 100 and a o = 20 is the 
following: 


20 
X’= — (X— M) + 100 
o 


in which X’ = standard score in the ‘ 
x = original or raw score 
M = mean of the raw score distribution 


100 and 20 are the M and o of the new distribution 
© =SD of the original or raw scores 


Substituting for Mary’s r 


‘new” distribution 


aw score of 50 in Test 3, we have 
xs 20/7 (50 — 43) + 100 
= 120 
Howard’s standard sc 
from the same formul 
obtained score is 26 an 


ores in the new distribution are found 
a. In Test 4, for example, Howard’s 
d from the formula we have 
X’ = 20/4 (26 — 28) + 100 

= —10 + 100 or 90 


The IQ as a Standard Score 39 


The formula will convert any pupil’s raw scores into standard 
scores when the .M of the standard score distribution is 100 and 
© is 20, 

When put in standard score form, Mary’s and Howard's scores 
can be compared directly; and the five scores of each child can be 
combined with equal weights. On the five tests, Mary’s average 


is 99 and Howard’s 102. 


The IQ as a Standard Score 

When raw scores are converted into standard scores in a dis- 
tribution with a mean of 100 and a ø of 15, these new scores 
are often called “deviation 1Q’s” (page 36). In the Wechsler- 
Bellevue Intelligence Test, for example, IQ's are determined by 
this method. ‘A Wechsler-Bellevue IQ of 115 is 10 above the 
Mean of the group; an IQ of 85 is —10 below the mean of 
the group. 

A general formula for transforming obtained scores into 
standard scores with any given mean and © is 


X= (X—-M) +M 
o 


where 
z = standard score in new distribution 
X = obtained score (usually in points) 
o = SD of the new distribution 
o = SD of obtained score distribution 


AM‘ = AL of standard score distribution 
M = M of raw score distribution 
This formula may be used to compute deviation IQ’s. Suppose 

that Arthur J., a veteran 32 years old, earns a score of 86 on an 
intelligence test for which the mean of his age-group is 80 and 
ri © is 10, What is Arthur J.’s deviation IQ? Substituting in the 
ormula, we have 

X’ = 15/10 (86 — 80) + 100 

= 109 


40 Statistics in Mental Testing 


The formula is useful when we wish to convert the sub-tests of 


a battery into comparable units which may be combined into a 
single score. 


Normalized Standard Scores, or T-Scores 


When raw point scores are transformed into PR’s and the 
resulting PR’s are converted into equivalent “scores” in a normal 
distribution, the final scores are said to be “normalized.” If the 
normal scaling distribution into which the scores are converted 
has an M = 100 and o = 10, the normalized scores are called 
T-scores. Converting raw scores into T-scores can be easily 
done with the aid of tables prepared for this purpose. First, the 
PR’s of the scores (or of the midpoints of the successive inter- 
vals) are read from an ogive. The T-scores (normalized scores) 
corresponding to these PR’s are then read from tables, T-scores 
range theoretically from 0 to 100, practically from about 15 to 


85. The method of computing T-scores for a given distribution 
will be found in detail in the ref 


erences, 
For several reasons, T-scaling is theoretically the soundest 
method of converting raw 


Scores into an equal-unit scale. Many 
of the widely used educational achievement tests make use of 
some variety of T-scaling. T-scores can be added or averaged; 
they have the same meaning 


and denote the same relative 
achievement. 


NORMS 

Norms are scores which are ty 
of a given age or grade. To pro 
On group tests are expressed in 
malized scores, Performance te: 


scales have norms expressed in MA’s and 1Q’s. Many group 


intelligence tests also have their raw scores put into MA and 


IQ terms. Such MA’s and IQ’s are rarely comparable to the 
MA’s and IQ’s of the Stanford-Binet, 
Educational achievement tests usu 


pical or characteristic of pupils 
vide comparable norms, scores 
PR’s, standard scores, or nor- 
sts and individual intelligence 


ally provide both age and 


Normalized Standard Scores, or T-Scores 41 


grade norms. From a table of norms, a teacher can tell whether 
her class is up to grade level, and she can tell how individual 
pupils in her class stand relative to each other on the sub-tests 
of the battery. Suppose that Carl W., age 11-2, and just entering 
the sixth grade, earns a score of 18 on an arithmetical problems 
test of the Metropolitan Achievement Test. From the table of 
norms we find that Carl has a PR of 68 on the test. Further- 
more, we find that his age-equivalent is 12-4 and his grade equiv- 
alent is 6.9. Carl’s score is typical of children about a year older 


than he, and his knowledge of arithmetic equals that of children 


who are completing the sixth grade. His PR, of course, reflects 


performance above the average. 

The SRA (Science Research Associates) verbal and non- 
verbal tests are group tests of intelligence. Norms are given in 
PR’s and IQ’s. If a child achieves a score of 34 on the verbal 
section, for example, his PR from the table of norms is 40 and 

ard score) is given as 96. The Stanford 


his IQ (really a stand 
Achievement Test provides age and grade equivalents to obtained 


Scores. Raw scores from nine sub-tests are converted into an 
equal-unit scale, in accordance with which a profile is drawn up 
(page 35). Suppose that Louise M., age 12 years and 6 months 
and in the last quarter of the seventh grade, earns a raw score of 
40 on the science test of the battery. From tables of norms we 
find that this score has a grade equivalent of 8.3 and an age 
equivalent of 13-4. Thus Louise’s score in science places her 


above her age and grade levels. Her PR on the science test is 60. 


Many aptitude tests supply scaled score norms for various 


groups of workers differing in experience, training, and skill. 
Interest inventories are scored so as to reflect an applicant’s in- 
terests in a large number of occupations. Thus if the vocational- 
interest blank is scored with the key for lawyer interests, we can 
tell whether the applicant has the interests of a lawyer and to 
what extent. Scores from personality questionnaires serve to iden- 
tify a subject as “dominant,” “introverted,” or “neurotic” in 
relation to the norms given for these classifications. 


42 Statistics in Mental Testing 


Teachers use test norms for a number of purposes, which will 


i i ant 
be elaborated upon in later chapters. Among the more importan 
objectives, we may list the following. 


1. To estimate group achievement. Performance of the class as 
a whole can be evaluated against national, state, or local 
norms (page 115). 

2. To evaluate individual achievement. A pupil’s score on an 

educational achievement test is always considered in connec- 

tion with his native capacity or mental alertness. A slow or 
dull child may be working up to his limit, whereas a bright 
child may be performing below expectation. 


3. To evaluate family and cultural background. The achieve- 
ment of a class or of an individual will always depend on his 
socio-economic status, family background, and Opportunities. 

4. To evaluate the curriculum effects. A pupil’s achievement 
must be judged as good or poor in the light of the content, 
emphasis, and objectives of the school. 

py 


To measure individual differences. There are always wide 
differences in academic achieveme 


nt within a group or class. 
These differences are due in part to differences in native 
ability and in p 


art to differences in environmental oppor- 
tunities. 


SUGGESTIONS FOR FURTHER READING 

J Garrett, H. E. Statistics in Psychology and Education (Sth edition). 
New York: Longmans, Green, 1958. 

Noll, V. H. Introduction to Educational 
Houghton Mifflin, 1957. Chapter 3. 

Thorndike, R. L., and Hagen, 
Psychology and Education. 

Ross, C. C., and Stanley, 
edition). New York: 


Measurement. Boston: 


E. Measurement and Evaluation in 
New York: John Wiley, 1956. Chapter 5. 


J. C. Measurement in Today’s Schools (3rd 
Prentice-Hall, 1954, Part I. 


QUESTIONS FOR DISCUSSION 


1. A fifty-item multiple-choice test in science 


» administered to ninety 
pupils, showed scores ranging from 16 to 48. 


Fifty scores fell between 


Normalized Standard Scores, or T-Scores 43 


35 and 48. What would the distribution be like? Would it be skewed? 
What measure of central tendency would be most suitable? 

2. In question 1, what would you conclude about the suitability of the 
test for the group? 

3. Explain the implications of each of the following correlation 
coefficients: 

(a) The correlation between height and score on an arithmetic test 

is 04. : 

(b) Ratings of pupils for social adjustment and aggressiveness show a 

correlation of —.65. 

(c) The correlation between term grades and scores on a group 

intelligence test is .70. 

4. Rank the following 23 scores in order of size: 35, 40, 31, 29, 35, 23, 
32, 34, 28, 34, 15, 14, 34, 40, 22, 32, 30, 39, 50, 19, 40, 27, 37. Compare the 
“mid-score” with the mean. 

5. Karl’s PR on a biology test is 48. What does this mean? 

6. Margaret has taken five tests. What would be the advantage of 
expressing her scores on these tests as PR’s? 

7. Given the following: 


Parargaph Reading Arithmetic 
Mean 51.7 38.6 
9.2 6.5 


o 
William achieves a score of 56 on the first test and 35 on the second. 


Convert these raw scores into z-scores. 
. 8. How are age and grade norms obtained? 
in determining placement? 
9. Two classes earn about the s: 
twice the size of Class B’s. What do you cone 
10. How would you validate a teacher-made test? 


Which is the more useful 


ame mean on a test, but Class A’s SD is 
lude from this fact? 


CHAPTER 3 


INDIVIDUAL INTELLIGENCE SCALES 


This chapter will consider four individual intelligence scales or 
test batteries.* These are (1) the Stanford-Binet** (1937 or re- 
vised form) designed for children from age 2 through adolescence; 
(2) the Wechsler Bellevue Intelligence Scale, for use primarily 
with adults; (3) the Wechsler-Intelligence Scale for Children 
(WISC); and (4) the Arthur Performance Scale, useful from 
about age 4 to maturity. These four scales—one for adults and 
three for children—are representative of the best individual in- 
telligence scales now available, They are carefully constructed, 


* A test battery is a 
team. 


** The full name is Stanford Revision of the Binet-Simon Scale. 


group of carefully selected tests designed to operate as a 


44 


The Concept of General Intelligence 45 


widely used, valid and dependable. Ordinarily, the individual 
intelligence test will not be administered by the classroom 
teacher. But the teacher must be familiar with the make-up of 
these scales and with their role in the school program if he is to 
make good use of the test findings. 

The individual intelligence examination should not be admin- 
istered by a novice. To give such a test—and more important to 
interpret it—requires special training in mental measurement 
and in clinical psychology, plus a sound knowledge of psycho- 
logical theory. In addition, at least six months should be spent in 
giving and scoring these tests under supervision, if one is to have 
a minimum of “clinical experience.” Unfortunately, perhaps, 
directions and materials for giving the individual scales are 
_ readily available in the manuals, and the beginner is tempted to 

try his hand at administering the tests. Much undeserved criticism 
of the individual intelligence test—and of the MA and IQ—has 
arisen from the faulty administration and interpretation of these 
scales by the unskilled amateur. 


The Concept of General Intelligence 


Before examining the individual intelligence scales in detail, 
we must get a clearer notion of what the tests are attempting to 
measure, This means that we must formulate a definition of what 
is meant by “general intelligence.” 

Definitions of general intelligence have run the gamut from 
such comprehensive biological descriptions as adjustment to the 
environment to the fairly narrow designation of aptitude for 
academic work. The French psychologist Alfred Binet defined 
intelligence as (1) the ability to take and maintain a definite 
direction—that is, to carry through a course of action once 
begun; (2) adaptability to new situations and new requirements; 
and (3) the power to evaluate and criticize one’s own acts (not 
present in the feebleminded). Other psychologists agreeing in 
the main with Binet have stressed adjustment to life and capacity 


46 Individual Intelligence Scales 


to learn. In contrast with these broad formulations, Lewis M. 
Terman, author of the Stanford-Binet, has defined intelligence 
simply as the ability to carry on abstract thinking. 

Definitions of general intelligence must of necessity be broad 
when they stress biological adaptability to life. Such definitions 
are hardly incorrect, but neither are they useful. Indeed, any 
attempt to encompass such a comprehensive function as gencral 
adaptability is a well-nigh impossible task. On the other hand, a 
definition of intelligence simply as the ability to do school work 
is certainly too narrow; we should include proficiency in every- 
day activities in business and the professions, where aptitude 
displayed in school finds ready application. 

In order to give greater precision to the concept of intelligence, 
the educational psychologist Edward L. Thorndike has suggested 
that we recognize at least three broad arcas of intelligent be- 
havior. These “intelligences” he called abstract, mechanical, and 
social. Abstract intelligence he defined as the “ability to under- 
stand and manage ideas and symbols, such as words, numbers, 
chemical or physical formulas, legal decisions, scientific prin- 
ciples and the like. . . .” In the case of students, this is very close 
to what is called scholastic aptitude. Mechanical intelligence in- 
cludes “the ability to learn, to understand and manage things and 
mechanisms, such as a knife, a gun, a mowing machine, an auto- 
mobile, a boat, a lathe. . . .” Social intelligence is “the ability to 
understand and manage men and women, boys and girls, to act 
wisely in human relations.” We should expect to find high ab- 
stract intelligence in scholars, scientists, 
and government; high mechanical intelligence in mechanics, 
builders, expert carpenters and plumbers; and high social in- 
telligence in politicians, salespeople, leaders in society. Presum- 
ably the successful civil engineer possesses high abstract as well 
as high mechanical intelligence; the successful criminal lawyer 
abstract as well as social intelligence; the machinery salesman 
mechanical and social intelligence. These “intelligences” are 
positively, but not always highly, correlated. Hence, a high level 


executives in business 


The Concept of General Intelligence 47 


of one “intelligence” may accompany a fairly low degree of 
another. A nuclear physicist (high in abstract intelligence) may 
be socially inept. And the man successful in business or politics 
may be mediocre in mechanical skills. Perhaps the able jack-of- 
all trades can be expected to rate well, but not necessarily very 
high, in all three areas. 

On examining the individual intelligence test, we find that it 
presents a variety of problems which demand the ability to utilize 
ideas and symbols, for example, words, numbers, diagrams, 
pictures, geometrical figures. When used with young children, 
general intelligence tests are primarily measures of mental alert- 
ness on the abstract level. For adults, these tests are measures of 
the aptitude for such occupational and other tasks as draw upon 
abilities operative in school work. In short, the individual intel- 
ligence test measures abstract or scholastic ability primarily and 
is rarely a gauge of mechanical aptitude or of social competence. 
The evidence for this view comes from an analysis of the tests 
themselves, as well as from many studies in which individual 


Intelligence tests have been used. 


THE STANFORD-BINET INTELLIGENCE 
SCALE (1937 REVISION) 


Because of the time required to administer the Stanford-Binet 
(in most cases forty minutes to an hour) and the training de- 
manded of the examiner, this test is rarely given routinely in 


Most schools, The classroom teacher must be generally familiar 
i er to know what can 


With the Stanford-Binet, however, 1m ord 

be expected of it—that is, how it might add to her knowledge 
of a given pupil. The Stanford-Binet is a valuable supplement to 
2 group intelligence test or to an educational achievement ex- 
amination when (1) a child has a severe reading disability or 


Some physical handicap (for example, in sight, hearing, or w 
Cular co-ordination); (2) when a pupil exhibits marked emotiona 


Stress or emotional disturbance; and (3) when other test nase 
Or school marks do not jibe with the teacher's estimate of the 


48 Individual Intelligence Scales 


pupil’s ability. For purposes of routine classification and place- 
ment, the group intelligence test is about as satisfactory as the 
Stanford-Binet, but the latter will provide a more accurate, de- 
tailed, and comprehensive appraisal of intellectual level and is 
more useful in diagnosis and prediction. 


Description. The 1937 edition of the Stanford-Binet represents 
a careful and thorough re-working of the earlier 1916 scale. The 
number of test items was increased from 90 to 129 and the scale 


TABLE 3-1 Illustrative Tests from Stanford-Binet Scale 
Year IV 
1. Picture Vocabulary Child must recognize and name everyday 
objects seen in the pictures. 
2. Naming Objects Child is shown small toys representing com- 
from Memory mon objects. These he names, or they are 
named for him. Later he must recall from 
memory the name of each object. 
3. Picture Completion Child must finish the incompleted drawing 


of a man. 
4. Pictorial Identifi- Pictures of objects on cards to be identified. 
cation 
5. Discrimination of Recognition and identification of simple 
Forms geometrical forms. 
6. Comprehension Sensible answers to “why” questions. 
Alternate: Memory Repetition of short sentences read aloud to 
for Sentences the child. 
Year X 


1. Vocabulary The examinee must give definitions of 


eleven words in a standard vocabulary list. 

2. Picture Absurdities Must recognize what is “foolish” in a pre- 

II sented picture. 

3. Reading and Reads a selection and reports from memory 
Report what is read. 


4. Finding Reasons Gives sensible reasons to explain cause-and- 


effect relations in familiar situations. 


5. Word Naming Names as many words as he can in one 


minute: a measure of word fluency. 


6. Repeating Six The lists are read aloud at the rate of about 


Digits one a second. 


The Concept of General Intelligence 


extended down to lower age levels and much strengthened at 
upper age levels. Two equivalent forms of the scale, called L and 
M, were constructed. Table 3-1 contains a selection of the items 
at different age levels. Note that at the lower age levels, such as 
IV, the test situations make use of objects and pictures and require 
that the child understand and carry out oral directions. At the 
upper age levels, X, XIV, and Average Adult, the test items are 
more abstract and bookish; the problems require verbal and 


numerical manipulation, reasoning 


g, logical selection and choice 


for Years IV, X, XIV and Average Adult 
Year XIV 


1. 


> 


v 


6. 


Vocabulary 
Induction 


Picture Absurdities 
HI 

Ingenuity 
Orientation: Direc- 
tion I 


Abstract Words II 


Average Adult 


L 
as 


EA 


4. 


S 


Vocabulary 
Codes 


Differences 
Between 
Abstract Words 
Arithmetical 
Reasoning 
Proverbs 


6. Ingenuity 


7A 


8. 


jj Se 


Memory for Sen- 
tences 
Reconciliation of 
Opposites 


Larger vocabulary required than at year et 
Tests ability to grasp and apply a general 


rule. 

Must recognize what is “foolish” in a pic- 
ture; more difficult than at Year X. 

Tests ability to solve problems mentally. 

Must be able to solve problems involving 
space relations by following fairly com- 
plex directions. 

Must define words like “Joyalty” and 
“justice.” 


Larger vocabulary than at Year XIV. 

Must learn two codes and write messages 
in them. 

Tests ability to generalize; makes use of 
fairly difficult concepts. 


Requires solution of mental arithmetic 
problems. 

Interpretation of proverbs and fables. 

Solution of problems requiring “mental 
manipulation.” 

Tests ability to reproduce rather long and 
involved sentences heard once. 

Must tell how words denoting opposite 
states are alike. Tests ability to grasp ab- 
stract relations. 


50 Individual Intelligence Scales 


and good judgment. Memory for numbers and for sentences 
recurs throughout the scale. Questions dealing with specific facts 
learned in and out of school are excluded, but many common- 
knowledge questions are included on the reasonable assumption | 
that what a person has learned in everyday living is a good index 
of what he can learn—and will learn—later on. Some Stanford- 
Binet test materials are shown in Figure 3-1 (facing page 54). 


Scope. The placement of test items at a given age level was 
made to depend on the responses of one hundred children at cach 
age level below 6, two hundred children at ages 6 to 14 inclusive, 
and one hundred children at ages 15 through 18. In all, about 
3,000 children constituted the standardization group. 

Terman and his co-workers selected children whose parents 
constituted a good cross section of occupational levels in the 
United States for the year 1930. The Stanford-Binet, like the 
original Binet Scale, is an age-scale. (See page 32.) It begins at 
2 years and items are grouped at one-half year intervals (at 2, 
2%, 3,312, 4, 4%) up to 5 years. Mental growth at the lower 
age levels is so rapid that the authors of the scale thought it wise 
to narrow the gaps between age levels over this range. From 5 
years to 14, test items are grouped by year intervals; and beyond 
14 there is an average adult level and three superior adult levels. 
The Stanford-Binet is most useful over the age range from about 
6 to 14—that is, over the elementary grades. 


Scoring. The Stanford-Binet assigns a mental age (MA) to a 
child in accordance with his ability to progress up the age scale. 
As shown in the examples on page 51, two children may earn the 
same MA on the Stanford-Binet in different ways. i 

James, who is 9-3 or 111 months old, earns an MA of 8-10, or 
106 months, by scattering his answers up the scale from age 
VII to age XIII. Robert also earns an MA of 106 months, but 
does not scatter as much as James. MA is a measure of mental 
maturity or status. Children differ in the way in which they 
answer the test items, but by and large a child comes out with an 


The Concept of General Intelligence 51 


Test Record of James Brown, chronological age 9-3, 
or 111 months* 


Tests Passed Months Credit Total Credit 
Year Level and Failed Per Test Year Month 
VII all passed 7 

VII four passed 2 months 8 
IX three passed 2 months 6 
X two passed 2 months 4 
XI one passed 2 months 2 
XII one passed 2 months 2 
XIII all failed 0 
7 22 


* The expression 9-3 means 9 years and 3 months. 
MA = 8-10 or 106 months 


James’ IQ = MA/CA x 100 = 106/111 X 100 = 95 


Test Record of Robert Green, chronological age 8-4, 
or 100 months 


Tests Passed Months Credit Total Credit 
Year Level and Failed Per Test Year Month 
VII all passed 7 

VIII five passed 2 months 10 
IX four passed 2 months 8 
X two passed 2 months 4 
XI all failed 0 
7 22 


MA = 8-10 or 106 months 
Robert's IQ = 106/100 = 106 (decimal dropped) 


MA which indicates his ability to perform mental-manipulative 
tasks like those of the Scale. 

The intelligence quotient, or 1Q, is found by dividing the 
child’s MA by his CA (chronological age) and is a measure of 
brightness or dullness. James has an 1Q of 106/111, or 95, and 
Robert who is 11 months younger, has an IQ of 106/100, or 106. 
Both boys have the same mental maturity, but Robert is brighter 
than James because he has reached the maturity level of 8-10 
at an earlier age. The two measures, MA and IQ, are comple- 
mentary, each providing distinctive information. A child of 8 


52 Individual Intelligence Scales 


and a man of 40 may each earn an MA of 8 years on the Stanford- 
Binet (be of the same mental status in terms of the tests). But 
the child has an IQ of 100 (8/8) and is normal, whereas the man 
is feebleminded, with an IQ of approximately 53. (Read from the 
tables in the Manual.) * 

The IQ is a developmental ratio which inevitably loses its 
value as a child grows older and mental maturity is approached. 
There is little difference in mean performance on the Stanford- 
Binet at ages 15, 18, and 20, and a correction table is provided 
in the Manual which adjusts the CA divisors in order to make the 
older person’s IQ comparable to that of the child. There is no 
specific age at which intelligence can be said to “mature” or 
reach its peak, but 15 is taken somewhat arbitrarily to be the MA 
of the average adult on Stanford-Binet. For any person over 16, 
therefore, the corrected divisor is 15 years. The highest MA 
which can be earned on the Stanford-Binet by passing all of the 
tests in the Scale is 2234 years. This MA yields a maximum IQ 


for adults of 152—found by dividing 273 months by 180 months 
(that is, 15 years). 


STANFORD-BINET IN THE SCHOOLS 


The evaluation of pupils from their school grades or from sub- 
jective impressions of cleverness or brightness is often quite 
misleading. A teacher may describe a conscientious, amiable girl 
of ten who is one year overage for grade as “bright” when her 
IQ turns out to be relatively modest. Contrariwise, a rude, in- 
attentive youngster may be rated as “about average” or even 
below average when his IQ is in reality considerably above 
normal. Judgments of intelligence are always influenced by 
personality traits and social behaviors. It is not strange, there- 
fore, to find that two pupils must in general differ by as much 
as twenty points of IQ before a teacher is forced to lay other 


* See Terman, L. M., and Merrill, M. A. Measuring Intelligence. New York: 
Houghton, Mifflin Co., 1937. Tables, Pp- 415-450. 


Range of IQ's in the School Population 53 


criteria aside and admit that the badly behaved youngster is 
brighter than the courteous, hardworking one. 

Teachers should know certain facts about the IQ, what it is 
and how best it can be used, in order to make maximum use of 
the information provided by the test. More specifically, the 
classroom teacher should know (1) the range of 1Q’s to be ex- 
pected in the school population, (2) the dependence to be placed 
on the IQ as a measure of intelligence, (3) to what extent the 
test has diagnostic value, and (4) the limitations of the IQ and 
the precautions to be observed in making interpretations based 
on it. These topics will be considered in the following sections. 


Range of IQ's in the School Population 


The frequency polygon in Figure 3-2 shows the distribution 
or spread of IQ’s for the nearly 3,000 children from 2 to 18 
years old who made up the standardization sample. The fre- 
quency polygon is close to the normal curve model (page 17). 
1Q’s center at 100 and range about equally above and below this 


FIGURE 3-2 Distribution of 1Q’s on the Stanford-Binet Scale 
for Nearly 3,000 Children, 2-18 Years Old 


iegeasrsteeee 


g 4b Ages 2 to 18 
2 12|— N=2904 
fi 


È 10 t | | l 
K | o=16.4 
4 | 
2 
o 
— A 55.65. 75. 85- 95- 105- 115- 125- 135-145-155- 165- 
43 EA Sa G4 B4 94 im 114 124 134 144 154 164 174 


and Merrill, Maud A., Measuring 


From Terman, Lewis M., erri 
by permission of Houghton Mifin 


Intelligence. Reproduced 
Company. 


54 Individual Intelligence Scales 


value. The o of the IQ distribution is about sixteen points 
(exactly 16.4). This means that the middle 2/3 of school chil- 
dren will earn 1Q’s between 84 and 116. About 1/6 of the 
children will have 1Q’s above 116 and 1/6 will have IQ's below 
84. See Figure 2-7. The percentage of school children who can 
be expected to occupy the different IQ levels may be summar- 
ized as follows: 


TABLE 3-2 


Numbers of Children in the School Population to Be 
Expected at Various IQ Levels 


Percent of Children 
IQ Level Description in Each Category 


130 and above Superior or gifted 3 = 5 
110 - 129 Above average to high 25 
90 - 109 Average or normal 45 - 50 
70- 89 Low normal to dull 20 - 25 
Below 70 Dull to feebleminded T= 3s 


The number of children found in any group (especially in the 
two extreme groups) will vary somewhat with the social and 
economic conditions of the community and with the standards 
set up for defining the different intelligence levels. 

The IQ is useful in setting educational expectations. Suppose 
that William Butler, a fifth-grade pupil in a large school system, 
has a chronological age of 10-2 and a Stanford-Binet IQ of 116- 
William reads at fifth-grade level, is somewhat above average in 
his other subjects, and is excellent in arithmetic. He is a quicts 
well-behaved boy who seldom becomes angry or annoyed. 
William makes friends readily and is accepted as a member of 
his group. What are William’s educational expectations? 

Table 3-3 will be of help in answering this question. William 
falls in the upper 16 per cent of school children. He should have 
no trouble completing elementary and high school. If he is in- 
dustrious, emotionally stable, and has intellectual interests 
William may be encouraged to go to college. It might be w15¢ 


FIGURE 3-1 Test Materials Used in the Stanford-Binet Scale 


Reproduced through the courtesy of C. H. Stoelting and McGraw-Hill. 


FIGURE 3-4 Items from the Performance Part of the Wechsler 
Adult Intelligence Seale 


ae eee — 


Reproduced by permission of The Psychological Corporation. 


“uonvIOdioD eoBojoyr{sq IL JO worsswwad q paosnpoiday. 


\ j = 


IVI WOT ANG age u punog sisay aouvusofisg S-E AUN 


Range of IQ's in the School Population 55 


to advise a college of not too high standards if William is lacking 
in self-confidence, and to enlist his parents’ enthusiastic support. 

„Mary, age 12-0 with an IQ of 83, presents a very different 
picture from that of William. Mary is doing barely passing work 
in the fifth grade, though she is about two years overage for the 
grade. Since her MA is not more than 10 years,* she is perhaps 
doing all we can reasonably expect of her. It would be manifestly 


* Since MA/CA = IQ, Mary’s MA is 12 X .83, or about 10 years. 


TABLE 3-3 


n in Relation to IQ Level 


Educational Expectatio 


IQ Level 
(Stanford-Binet) Educational Expectation 


120+; Can do acceptable work in a first-class college if properly 
motivated. 


115 - 119 Should do acceptable but not outstanding college work. 
Would probably do best in a small college where the 
work is individual and standards not too high. 


and may do well in the less 


105 ~ 114 Should complete high school, 
have trouble with science 


difficult college courses. Will 
and mathematics. 
90 —- 104 This group constitutes about 50% of the elementary 
n. If not retarded by illness or other 
lete the eighth grade on schedule. 
do fairly well in high school. 


school populatio’ 
causes should comp! 
Some of these pupils will 

SO 189 Usually one to two years over age for grade. Acceptable 
high school work very unlikely for IQ’s below 90. A 
child of IQ 80 will compete the eighth grade—if at all— 
two-three years behind schedule. 


75 ~ 79 These children may reach the fifth grade. Will rarely 
go beyond unless given much individual attention. 


Below 75 If one of these children reaches the fifth grade he will be 
14-15 years old. Unable to do fifth-grade work; but be- 
cause of chronological age is likely to be pushed ahead 
after repeating each grade two or three times. May be 
promoted because of age far beyond his mental capacity. 


56 Individual Intelligence Scales 


unfair to scold Mary and insist that she “try harder.” Mary’s 
educational expectation (see Table 3-3) is no higher than the 
eighth grade, if that. 


Stability of the IQ 


When a second form of the Stanford-Binet is administered to 
a child, this second IQ will often vary somewhat up or down 
from the first determination. Norman’s IQ, for example, may 
be 109 today, whereas it was 112 six months ago, and may very 
well be back to 106 six months hence. The stability of a test 
score when the test is repeated or another form given, is called 
the reliability of the test (page 28). Stanford-Binet is one of 
our most dependable mental examinations, with reliability co- 
efficients which are usually well over .90 (page 29). Despite 
this fact, fluctuations in individual IQ’s can still be expected 
when the test is repeated or a second form administered. 

The reliability of a test is conveniently expressed by the stand- 
ard error (SE) of a test score (page 30). The SE gives the allow- 
able (one might almost say the inevitable) changes to be expected 
when a second form of the test is given. The SE of the Stanford- 
Binet IQ is four to five points* for IQ’s between 90 and 110. 
The SE is slightly higher for high IQ’s and somewhat lower for 
low IQ's. Expressed in terms of chances or probability of change, 
a SE of five points means that the odds are roughly 2:1 that an 
IQ of 102, for example, will not be higher than 107 (102 + 5) 
nor lower than 97 (102 — 5) on retest. The SE represents the 
amount of fluctuation to be expected in most cases. The change 
in a few individual cases may be somewhat greater than five 
points or somewhat less than five points. Fluctuations in IQ from 
time to time arise from many causes: changes in the testing 
situation and changes in the child being tested. When a child’s 
mental or physical health or his home or school environment 
change radically between tests, fluctuations in IQ can be ex- 


° When SEiq = 16 V 1-.90, we have 5 as the approximate value of the SE. 
This is a slight overestimation, as the reliability coefficient is usually above -90. 


The Stanford-Binet IQ in Diagnosis of Child Behavior 57 


pected. Mental measurement is never as precise as physical meas- 
urement: a child is a much more variable “object” than, say, a 
piece of metal. Changes in IQ from one test to another rarely 
shift a child from one classification to another, however (see 
Table 3-3)—that is, from normal to superior or from dull to 
normal. The consensus, in fact, is that the IQ is extremely hard to 
change and that we can accept an 1Q when expertly determined 
as a reliable appraisal of a child’s general mental level. 


The Stanford-Binet IQ in Diagnosis of Child Behavior 


Children who achieve the same mental age will differ in IQ 
when their CA’s differ (page 51). Furthermore, even when the 
IQ is the same, two children may differ sharply in various 
aspects of mental development, as shown by the sorts of tests 
Passed and failed and by the degree of scatter over the scale. 
The Stanford-Binet is primarily a standard test-interview de- 
Signed to furnish a cross-sectional view of a child’s intellectual 
Capacities—that is, to give the level at which the child normally 
functions, At the same time, the school psychologist, in writing 
an account of a child’s performance on the test, will usually 
Note irregularities in development and learning ability, and these 
Observations provide the teacher with valuable clues to an under- 
Standing of the child. Visual handicaps, inco-ordinations, and 
other physical handicaps may be noted; so also may be noted 

¢ficiencies in arithmetic skills, in word comprehension, in rea- 
Soning, and in current information. The sub-tests of the Stanford- 
— call for fairly specific performances, and are not sufficiently 
«merous or comprehensive to permit the final judgment that 
John is weak in number work, but excellent in rote memory,” 
Or that “Sarah’s verbal facility far exceeds her manipulative 
skills,” But the pattern of a child’s responses and the relative 
Strengths and weaknesses displayed on groups of items will pro- 
Vide useful information. 
TO se are often puzzled when a 
em is, at the same time, describe 


child who is a discipline 
d as above normal in in- 


58 Individual Intelligence Scales 


telligence. The reason, of course, is that Stanford-Binet is not a 
measure of social intelligence or of emotional stability but of 
general verbal or abstract level (page 46). At the same time, the 
observant psychologist will note and record characteristic emo- 
tional and temperamental behavior displayed as the child takes 
the test. The rude or indifferent youngster, who doesn’t care and 
who doesn’t co-operate; the spoiled and petulant “brat,” who 
gives up and pouts at the first failure; the timid and insecure 
child, who inquires eagerly “Is that right”? after answering each 
item—all these children reveal their distinctive personality traits 
by the manner in which they tackle the test. Standards of be- 
havior in the home, ideals of conduct, values, and attitudes are 
often exhibited clearly, if indirectly, in the course of a mental 
examination. At Year VII, for example, is the question “What's 
the thing to do if another boy (or girl, depending on the sex of 
the examinee) hits you without meaning to do it?” The child 
who is immature socially or reared in 


munity will answer promptly “Hit him back.” The 7-year-olds 
who are better trained in acce 


their replies, or su 


ate how qualitative 
help the classroom 


» MA = 8-2, IQ = 80. 

acher because of unsatisfactory 
is a good-looking, polite lad, 
manner. Anyone unacquainted 
e SM to be average in intelli- 
n the Stanford-Binet, SM’s 


ere halting, poorly phrased, and 
unco iding He had inaccurate and meager responses to 
seeing relations items—differences and similarities. He was 


The Stanford-Binet IQ in Diagnosis of Child Behavior 59 


poor in number relations; his co-ordination and rote memory 
were fair. SM is a dull boy who may reach the eighth grade, but 
is not likely to go beyond. It was recommended that SM under- 


take vocational training. 


Case II. RW, a girl; CA = 11-2, MA = 11-3, IQ = 100. 

RW is a well-developed girl, apparently calm and self-pos- 
sessed. She was referred by her teacher because of poor work in 
the sixth grade; she is described as being inattentive and given 
to daydreaming. RW seemed indifferent to the test, but did not 
refuse to co-operate. She often asked that a question be repeated, 
and the examiner suspects slight deafness. She became more in- 
terested as the test proceeded, especially when she got the answers 
to several questions. Her vocabulary is at Year X, but her verbal 
ability is about normal, as shown by her ability to deal with 
Pictures and verbal absurdities, name words, define abstract terms, 
and deal with similarities. Her attention was somewhat variable 
and she was easily distracted. She showed uncertainty in using 
Number relations, as, for example, in making change. RW is 
Normal in intelligence and should be able to do satisfactory work 
In the sixth grade. It is suspected that her daydreaming is, in 
part, a consequence of puberty. It was recommended that the 
classroom teacher check on RW’s friends, outside activities and 


home conditions. 


Case III. HP, a boy; CA = 6-5, MA = 9-6, IQ = 148. 

The second grade teacher is not sure what to do with HP; he 
Seems to know everything she is teaching. HP’s father is a 
Prominent surgeon. This boy entered school at 6-1 and was put 
1n the second grade. He is well mannered, normal in play and in 
Social activities, and gets along well with his classmates. HP 
whizzed through the tests for Years VI, VII, and VIII. His 
Vocabulary is at Year X. He defined an orange as “a citrus fruit, 
round and yellow, comes from Florida.” His co-ordination is not 
UP to his verbal level, but his memory and perception of differ- 
ences and likenesses are excellent. HP is a very bright youngster. 


60 Individual Intelligence Scales 


He should be ready for high school by age 12 or earlier. He 
should now be in the fourth grade, if he is ready for it a 
If promotion is not feasible, a program of outside reading an 
some special attention is suggested. 


Precautions to Be Taken in Interpreting the IQ 


Some of the factors which may influence a child’s IQ have 
been touched upon in preceding paragraphs. To what extent the 
IQ is an index of “innate ability” will depend upon the co-opera- 
tion and motivation of the examinee, and upon how expertly the 
test has been administered. Several conditions which may affect 
the reliability of an IQ are the following: 


Physical causes: Sensory defects, d 
and illness are also important. 


i i ; at al 
Examiner: The personal equation of the examiner may be crucial. Ment 

i i k t 

test examiners who are poorly trained, have harsh and unpleasan 

voices, peculiarities of manner or dress, or who are supercilious oF 


: Š i ; r d 
arrogant in their relations with the child get poor co-operation an 
uncertain test results. 


Testing conditions: Test results 
examination room js bare, 
Coaching on the tests must 
been widely distributed. 

Environmental surroundings: 


eafness, poor eyesight. Malnutrition 


are likely to be unreliable when T 
too cold or too warm, or overdecorated- 
always be watched for, since the rests have 


The degree of stimulation received in the 
home, the school and the community will markedly affect the test 
performance. Children from homes broken by divorce or by drunken- 
ness will often show IQ increases of as 
several months of kind treatment. On tlh 
good homes who have been transferred 
environment (as, 


much as twenty points after 
ve other hand, children from 
to a deprived and restrictive 
for example, in war) may show sharp drops in IQ. 

Because of the many factors which may 
tion, a Stanford-Binet IQ should not immediately be denounced 
as worthless should there be a considerable shift in a second 
rating. Instead, a drastic change in IQ should be taken as a 
challenge, and the causes ferreted out if possible. The neglected 
dull normal child when taken into a good home will often show 
an increase in measured IQ, as will a child adopted into a good 
family. By the same token, a normal child will do poor school 


affect its determina- 


Constancy of the IQ over the Age Range 61 


work if he is insecure and unhappy. It seems very unlikely that 
even sharp changes in IQ reflect a real alteration in a child’s 
aptitude. At least all of the environmental factors should be con- 
sidered before this conclusion is reached. 


Constancy of the IQ over the Age Range 


Suppose that Bob White, who is 7 years old, has an MA of 8 
years on Stanford-Binet and an IQ of 8/7, or 114. When Bob 
is 14 years old, his MA must be 16 years if his IQ is to remain 
Constant at 114 (16/14 = 114). The IQ is a measure of bright- 
hess or dullness relative to a child’s age group. Hence, should an 
IQ fluctuate widely—as, for example, from 114 to 85 or to 140— 
the ratio MA/CA becomes valueless. We have said earlier (page 
56) that when the IQ of a child has been determined by an 
eXpert, it is a highly dependable index. But whether the 1Q 
remains constant over the years from 6 to 14 (over the elemen- 
tary school, for example) will depend, for one thing, on the way 
in which the test has been constructed. This question is appro- 
Priate, therefore: “Is the Stanford-Binet so constructed as to 
Make a constant IQ probable or even possible?” 

There are three conditions which an intelligence test must 
Meet if the IQ, defined as the ratio, MA/CA,* is to remain 
COnstant over the age-scale. ‘These are: 

1. Increased spread of MA’s (larger SD’s) as we go up the 
@ge-scale. 

2. Homogeneity of mental function over the age range cov- 
cred by the scale. ‘Homogeneity means that the test measures the 


xample, from age 2 to age 18. 


Same “intelligence” for e > 
3. Zero correlation between chronological age and IQ. 


These conditions are met to a high—though not perfect— 
degree by the Stanford-Binet. They are not met, even approxi- 
Mately, by most group intelligence tests (page 97). Let us 
examine each condition further. 

1. The SD of the Stanford-Binet MA distributions increases 


* The IQ may also be defined as a standard score (p. 39). The conditions for 


IQ constancy, given above, apply only to age-scales. 


62 Individual Intelligence Scales 


fairly regularly with chronological age. At Year VII, for ex- 
ample, the SD of the mental age distribution is 1 year; at Year X 
the SD is 1.6 years; at Year XIII it is 2.3 years; and at Year XVI 
it is 2.6 years. This means that if Bob White has a CA of 7 years 
and is one SD above the mean for his age (that is, at 8 years), 
his IQ will be 8/7, or 114. If Bob maintains his rate of mental 
growth, at age 10 his MA will be 1 SD above the mean, or at 
11.6 years (10 + 1.6). Bob’s IQ is now 11.6/10 or 116, At year 
13, should Bob stay 1 SD above the mean, his MA will be 15.3 
(13 + 2.3) and his IQ 117. And at age 16, should Bob maintain 
his rate of growth, his MA should be 18.6 (16 + 2.6) and his IQ 
18.6/16 or 116. Figure 3-3 shows that when a child maintains 
an accelerated rate of growth, his IQ (like Bob’s) will remain 
approximately constant—that is, within 2 to 3 points, 


FIGURE 3-3 Age-Progress Curves for the Stanford-Binet Scale 


[Note that the spread of 1Q’s becomes greater with increasing 
chronological age.] 


16 rS 
14 ey > 
1Q="%/19 = 1209S 
o 12 SAS 
? S 
10 
z 
ce 8 1Q=%o=80 
= 6 la=%s=120 Ct 
4 
2 H 
E t 
4 8 12 16 


Chronological age 


Figure 3-3 shows that when a child is below the mean for his 
age, his IQ will again remain approximately constant should he 
maintain his slower rate of growth. If a child has an MA of 6 anda 
CA of 7—is 1 SD below the mean for his age—his IQ will be 
6/7, or about 86. Should this child maintain his slower rate of 


Constancy of the IQ over the Age Range 63 


ce at Year XIV his MA will be 11.8 (14-2.2) and his IQ 
ew e 11.8/14, or 84. It is the increasing spread of MA’s with 
a Pi CA which keeps the ratio MA/CA constant within 

points, always provided the child maintains a constant rate 


of growth, (See Figure 3-3.) 
2. Statistical analysis has shown that the correlation between 
d that the Stanford-Binet 


ean MA levels is very high, and 
easuring essentially the same “intelligence” as we go up the 
age-scale, 
Pd When a child reaches the upper "teens, mental growth 
Ga as shown by the Stanford-Binet fail to keep pace with 
lise nological age. When this happens, the curves 1n Figure 3-3 
y altitude and bend over to become parallel with the baseline. 
Failure of the MA to increase with CA leads inevitably to a fall- 
ng IQ among older children and, if uncorrected, to a negative 
Correlation between CA and IQ. (Negative correlation follows 
€cause the CA continues to increase, whereas the IQ no longer 
Oes—see page 52.) To overcome this fault in the age-scale, 
the authors of Stanford-Binet provide a steadily decreasing CA 
divisor from age 13 and above. This procedure bolsters up the 
IQ by lessening the denominator (CA) and thus balancing the 


decreasing numerator (MA). This means that a child’s IQ does 
Not have to decrease as the child grows older—and that there is 
iti gative) between CA 


n ; : 
© systematic correlation (positive OF ne 


THE WECHSLER-BELLEVUE 
INTELLIGENCE SCALE* 


Description. The Stanford-Binet is sometimes used to measure 
the intelligence of adolescents and young adults, but it is not well 


* The Wechsler Adult Intelligence Scale (WAIS), published in 1955, repre- 


Sents a revision and restandardizing of the Wechsler-Bellevue Intelligence Scale 


(W-BIs). WAIS makes use of the same principles of construction, scoring and 
Q derivation found in the older scale, and the two are essentially the same test. 
V-BIS is described here rather than WAIS because it is better known and is 


still Widely used. 


64 Individual Intelligence Scales 


suited to these groups, since the items of the test were selected to 
appeal primarily to children. A better examination for measuring 
adult intelligence is the Wechsler-Bellevue Intelligence Scale, an 
individual intelligence test designed especially for adults. The 
Wechsler-Bellevue is, on the whole, a well-made examination. 
The group used in standardizing the test battery—that is, the 
group upon whose answers the scoring and norms depends— 
consisted of about 1700 persons chosen from a larger group of 
3500. The sample was chosen to represent the occupational dis- 
tribution of the adult population at the time of the 1930 census. 
The sample is adequate in size, but the fact that it was drawn 
mostly from New York City and New York State renders ques- 
tionable its claim to represent the country as a whole. 

The Wechsler-Bellevue Scale consists of two parts, a Verbal 
Scale and a Performance Scale. Language is required in the first 
scale, but the tests in the second part demand no language in the 
actual solution of the problems. Directions, however, are given 
orally. What is called the Full Scale is a combination of the 


Verbal and Performance sections. The Verbal Scale is made up 
of five tests, as follows: 


VERBAL SCALE 


General Information: Twenty-five questions covering a wide range of 

common information and dealing with facts which all normal adults 

have presumably had a chance to learn. Questions are graded in 
difficulty from easy to hard. 

General Comprehension: Ten questions and two alternates, in each 

of which the examinee is asked to tell what should be done in certain 

situations, or why certain practices should be followed. The questions 
are planned to measure practical judgment, common sense, and 
understanding. 

Arithmetic Reasoning: Ten mental arithmetic problems. Each problem 

is presented orally and must be solved without the use of paper or 

pencil (“in the head”). 

4. Digits Forward and Backward: Memory span for digits presented one 
at a time and ranging in number from 3 to 9. In the second part of 
the test, examinee must give the list of numbers in reverse order. 

5. Similarities: Twelve word-pairs, each pair alike in some way. The 
examinee must say in what way the two words are alike. 


Constancy of the IQ over the Age Range 65 


a 


Vocabulary (Alternate): A list of forty-two words graded in diffi- 
culty to be defined orally. 


There are five tests in the Performance Scale, as follows: 


PERFORMANCE SCALE 
1. Picture Completion: Fifteen cards, each containing a picture from 
ee must give the missing part. 
each set containing from 
to arrange the pictures 


which some part is missing. The examin 

2. Picture Arrangement: Six sets of pictures, 
three to six separate pictures. The examinee is 

3 pad given set so that they tell a story. 
` ject Assembly: Three form-boards— 

the Hand. The parts of cach form-board must be put together, 


as in a jigsaw puzzle, to form a complete object. 

Block Design: Sixteen small cubes (blocks) colored red, white, and 
red-and-white on the sides. The blocks are to be arranged to match 
Seven designs presented on test cards. The designs require from four 


š to sixteen cubes. 
Digit Symbol: A well-known association test. Nine numbers are 


matched with nine symbols in accordance with a key. 

Samples of the items from the performance part of the 
Wechsler Adult Intelligence Scale are shown in Figure 3-4 

: "a 

aing page 54). These tests are “performance” in the sense 
that the examinee in solving the problem must make use of 
diagrams, pictures, form boards, and cubes. But “ideas”—that 
1s, symbols—are certainly not excluded. Wechsler’s performance 
tests, therefore, are measuring abstract rather than motor or 


mechanical intelligence. 

Scope. The Wechsler-Bellevue Scale provides scores in the 
form of “IQ’s.” Norms run as low as 10 years, but the scale’s 
Principal application is over the age range from about 20 to 60 
years. Beyond 60 years, Wechsler-Bellevue IQ’s are not always 
dependable, owing in part to the small samples at advanced age 
levels, But these IQ’s may be taken as useful estimates of general 
intelligence. Age-level scores on the Full Scale (Verbal + Per- 
formance) show a gradual decline after 20, the drop in score 
from age 20 to age 60 being about 20 per cent. 


Scoring. Following the directions given in the scoring guide 
(Manual), the examiner first adds up the items done correctly 


the Manikin, the Profile, and 
much 


66 Individual Intelligence Scales 
(speed is sometimes a factor) for each of the ten sub-tests. Scores 
on each sub-test are then converted into standard scores (page 
38), in which the mean for the 20-34 age group is set at 10 
and the SD at 3. Conversion of the separate sub-test scores 
into a common standard score scale allows the examiner to com- 
bine the tests into a single index and thus to compare the ups and 
downs in performance from sub-test to sub-test. F 
The Wechsler-Bellevue does not provide mental ages, since 
the concept of mental age, though useful with children, has little 
meaning when applied to normal adults. The Scale does provide 
for an IQ (called a “deviation IQ”), which is essentially a stand- 
ard score. There are three IQ’s, one from the Verbal, one from 
the Performance, and one from the Full (combined) Scale. In 
each case these deviation 1Q’s are found in the following way: 


Scores on the sub-tests (10 for the Full Scale) are added and the 
total is converted again into a stand 
Mean = 100 and a SD = 
example 


ard score, this time with a 
15 (page 39). At each age level (for 
» at 30, 40, 50) the mean score got from the sub-tests 1S 
set at IQ 100. A score which is one SD above the mean at 47y 
age level then becomes an IQ of 115. Putting the IQ for each age 
level at 100 adjusts for the steady fall in total test score with age- 
Standard score IQ’s or “deviation IQ’s” below 100 denote the 
same degree of retardation with reference to one’s age group: 
For example, we read from the Manual that a man aged 35 who 
achieves a score of 75 on the 10 tests of the Full Scale has an JQ 
of 92—is slightly below the mean of his age group. The same 
score of 75 becomes an IQ of 96 at age 45 and an IQ of 100 at 
age 60. This means that a total score of 75 on the 10 sub-tests 15 
“normal” (or “at age”) for age 60 and hence receives an IQ of 
100. But the score of 75 is below the mean for the younger 
groups. Again, the examinee who earns a score of 90 (Full Scale) 
has an IQ of 109 if he is 57, an IQ of 102 if he is 37, and an IQ 
of 94 if he is 22 years old. 

To summarize, Wechsler-Bellevue Scale IQ’s are converted, 
or standard, scores in which the mean is always 100 for each age 


Range and Stability of Wechsler-Bellevue IQ's 67 


group and the SD is 15. Wechsler-Bellevue IQ’s have the same 
meaning from one age to another in the sense that an IQ of 105 
or of 86 implies the same relative superiority or inferiority to the 
examinee’s age group. The Wechsler-Bellevue IQ is a standard 
score, whereas the Stanford-Binet IQ is a ratio, MA/CA. The 
two indices are highly correlated, but are not equivalent. To 
avoid confusion, it helps to write “Wechsler-Bellevue IQ” when- 
ever the deviation IQ is meant. Both the Wechsler-Bellevue IQ 
and the Stanford-Binet IQ are measures of abstract intelligence 
(page 46). 


THE WECHSLER-BELLEVUE SCALE 
IN THE SCHOOLS 


i The Wechsler-Bellevue Scale has been widely used in the 
individual study of adolescents and older students for whom the 
content of the Stanford-Binet is inappropriate. The test is most 
valuable, therefore, to teachers in the upper grades and in high 
schools and technical schools. 


Range and Stability of Wechsler-Bellevue 1Q’s 

The range of IQ’s in the general school population is about the 
Same for the Wechsler-Bellevue Scale as for the Stanford-Binet. 
Table 3-3 will serve, therefore, as a guide in the interpretation of 
a test score, Table 3-3 may be taken also as providing a statement 
Of the educational expectations of older students when we know 
the Wechsler-Bellevue IQ. The reliability of the Wechsler- 
Bellevue Scale, as given by its standard error, is approximately 
five points. Hence, the IQ from this scale has about the same 
Stability as the Stanford-Binet IQ. 


The Wechsler-Bellevue Scale in Diagnosis 


The Full Scale—like the Stanford-Binet—yields a measure of 
a student’s general mental level and is often used to provide this 
information. The Wechsler-Bellevue, however, has also been 


68 Individual Intelligence Scales 


widely employed in mental hospitals and clinics for the diagnosis 
of abnormal behavior. The Scale has been useful in the study of 
variations of performance in schizophrenia and other mental dis- 
eases, in senile deterioration, and in assessing the effects of brain 
damage and the results of brain surgery. The fact that there are 
eleven separate tests (six Verbal and five Performance) in the 
Full Scale has led clinical psychologists to attempt to discover the 
relative efficiency of various mental functions from irregularities 
in test performance. 

The diagnosis of differential abilities (strengths and weak- 
nesses) from the sub-tests of the Wechsler-Bellevue must always 
be taken as tentative, though an examination of the different sub- 
tests may provide valuable clues. The various tests of the Scale 
are too short and too complex (in that they test overlapping 
abilities) to allow a sweeping judgment to the effect that “Bill 
has poor planning capacity and poor judgment” or that “Mary 
has a good memory and adequate concentration.” Observations 
of this sort are valuable only if made cautiously and taken in con- 
junction with other evidence. The Full Scale is a good index of 
present mental efficiency, and the difference between the Verbal 
and Performance IQ’s may be significant of the academic vs. the 
non-academic “mind” (page 75). But judgments drawn from 
specific sub-tests with respect to strengths and weaknesses 10 
memory, learning, perception, planning capacity, concentration, 


emotional blocks, and the like must be taken as suggestive rather 
than conclusive. 


WECHSLER INTELLIGENCE SCALE 
FOR CHILDREN 
Description. The WISC, as it is called, is a dow 
of the older Wechsler-Bellevue to rend 
for young children. There are ten sub-tests and two alternates 
(twelve in all) in the WISC. The sub- 


tests have the same form 
and cover the same content as the Wechsler-Bellevue, except 


nward revision 
er the test more suitable 


Range and Stability of Wechsler-Bellevue IQ's 69 


that easier items have been added. Tests are grouped into five 
Verbal and five Performance as follows: 


Verbal Scale Performance Scale 
General Information Picture Completion 
General Comprehension Picture Arrangement 
Arithmetic Block Design 
Similarities Object Assembly 
Vocabulary (Digit Span) Coding (or Mazes) 


The Wechsler Intelligence Scale for Children differs in several 
respects from the Wechsler-Bellevue. In the Verbal Scale, 
Digit-Span proved to be less satisfactory than the other tests and 
hence became an alternate, Vocabulary being substituted. In 
the Performance Scale, coding is a somewhat easier version of the 
Digit-Symbol test. Mazes are sometimes given instead of coding, 
but the second test is usually preferred, since it takes less time 
to administer, The maze test is the only test not found in the 


Wechsler-Bellevue. 


Scope. The WISC is a better made test than the Wechsler- 
Bellevue. To provide norms, one hundred boys and one hundred 
girls were tested at each age level from 5 to 15. Children in the 
Standardization sample were drawn from cleven states and from 
three institutions for the feebleminded. The sample was carefully 
checked to give a cross section of geographic areas, urban-rural 
groups, and occupational levels of parents. 


Scoring. As was true of the W echsler-Bellevue, all sub-tests 
Were first converted into standard scores in a distribution with 
M = 10 and SD = 3. Tables are provided for reading scale 
Score equivalents to raw scores for each 4-month period from 5 
to 15 years. These equally weighted sub-test scores are added 
and then again converted into “deviation 1Q’s,” with Mean = 
100 and SD = 15 (page 39). Verbal, Performance, and Full 
Scale IQ’s may be read from appropriate tables in the Manual. 


70 Individual Intelligence Scales 


Approximately 50 per cent of school children can be expected 
to earn WISC IQ’s between 90 and 110 


Differences Between the WISC and the Stanford-Binet. The 
WISC differs from the Stanford-Binet in several important re- 
spects. First, all items of a given sort in the WISC are organized 
into sub-tests instead of different kinds of items being placed at 
successive age levels. WISC is a point scale rather than an age 
scale. Second, the WISC IQ is a deviation IQ—a standard score 
in a distribution with Mean = 100 and SD = 15—whereas the 
Stanford-Binet IQ is a developmental ratio or MA/CA. The two 
IQ’s are closely related (the correlations between the two sorts of 


scores run from .80 upward), but they are not identical (page 


66). The SD of the Stanford-Binet distribution of 1Q’s is 16, as 
against the WISC SD of 15; and some of the difference between 
the two IQ’s is due to the greater spread of the Stanford-Binet 
1Q’s, Furthermore, the two mental examinations differ in length, 
variety and difficulty of items, Finally, the WISC provides for 
three IQ’s—a Verbal, a Performance and a Full Scale. There 1s 


only one IQ from the Stanford-Binet, based upon all of the tests 
in the scale, 


THE WISC IN THE SCHOOLS 
Both the WISC and the 


Performance Scale IQ. 
re higher on the Stanford-Binet than 
pupils score higher on the WISC. 


Bright children tend to sco 
on the WISC, whereas dull 


Range and Stability of the WISC IQ's 71 


batters IQ’s of the Ww ISC (Verbal and Performance) are 
lative ae = in bringing out differences in verbal and manipu- 
the psw s. Sometimes a child (usually a boy) will do better on 

rmance tests of the WISC than on the verbal indicating, 
perhaps, greater aptitude for vocational than for academic sub- 
ome A bookish youngster who reads a great deal may do much 
= on the verbal tests. The performance IQ is usually higher 
— i ha in severely disturbed adolescents, and this differ- 
a en appears also in younger dull students. From the 
ia slg ce the child handles the verbal tests, the expert 
aeaa wall often note evidences of insecurity as revealed by 
en ence, verbosity, poor attention, and defeatism. Poor 
p mance on the manipulative tests often reveals inept plan- 
hee. and defective co-ordination, whereas good performance 
Shows concentration and adequate sensory-motor organization. 


Range and Stability of the WISC IQ's 


The range of WISC Full Scale IQ’s to be expected in the 
general school population, and the meaning of these “scores” are 


shown in Table 3-4. 


TABLE 3-4 
Intelligence Classification for WISC IQ's 
Percent 
1Q Ranges Classification in Each Group 
130 — very superior 2 
120 — 129 superior 7 
110 — 119 bright normal 16 
90 — 109 average 50 
80 — 89 dull normal 16 
70 -— 79 borderline 7 
69 below mental defective 2 


s will be seen that these classifications correspond closely to those 
or Stanford-Binet IQ’s. f 


72 Individual Intelligence Scales 


Reliability coefficients for the WISC are generally above .90- 
They are higher for the Full Scale than for either the Verbal or 


Performance Scales. The,standard error of a WISC IQ is 4-5 
points. 


MA’s from the WISC 


The WISC does not ordinarily make use of mental age, but 
when mental ages are required for clinical or for legal reasons. 
they can readily be determined. The Manual (Appendix E) pro- 
vides a table of “test-age equivalents to WISC raw scores.” By 
reference to the table, we find the chronological age of a child 
for whom a given raw score js typical (or average), and this 1S 
the MA corresponding to the score. For example, a score of 12 
on the Comprehension test is achieved on the average by children 
who are 10-6 years old. Hence, a score of 12 in Comprehension 
has an equivalent MA of 10-6. The mean of the sub-test MA’s is 
computed (Mean Test-Age Method) or the median of the MA’s 
(Median Test-Age Method). Either of these determinations gives 
the final over-all MA. A closely equivalent method for determin- ` 
ing MA’s from the WISC is by use of the formula MA = IQ X 
CA. A child who achieves an ÍQ of 110 and who is 8-2 years old 
has a MA of 110 X 8-2, or approximately 9-0 years. 


PERFORMANCE TESTS 
Development of Performance Tests 


Performance tests designed to measure general mental ability 
have been often used in the schools (1) as substitutes for the 
more verbal tests, and (2) as supplements to the Stanford-Binet 
and other linguistic scales. Performance and non-language tests 
must of necessity be employed with pre-school children and with 
the very dull. Such tests are useful additions to the Stanford- 
Binet or WISC in the mental examination of children with speech 
and language defects or children with visual and auditory impair- 
ment. Batteries of performance tests have long been used in 


The Arthur Scale 73 
psychological clinics and in institutions for the feebleminded. 
The classroom teacher should know about performance tests, 
though he will encounter them much less often than the WISC 
or the Stanford-Binet. 

The Pintner-Paterson Scale of Performance Tests (1917) was 
the first organized battery of manipulative and non-language 
tests. Widely used for many years, this scale has now been re- 
placed to a considerable degree by other batteries, based upon 
it. The Pintner-Paterson Scale consists of fifteen separate tests. 
The ten tests most often used (in what is called the Shorter 
Scale) include four form boards, three picture completion tests 
(of the jigsaw puzzle type), two object assembly tests, and 
one block-counting test. i 

Later performance scales are the Cornell-Coxe Performance 
Ability Scale (1934) and the Arthur Point Scale of Performance 

ests (1930, revised 1947). These test batteries draw heavily on 
the Pintner-Paterson, but include, too, important additions and 
Tevisions. In addition to these test batteries, there are a number 
of other performance tests, of which a graduated series of mazes, 
the Porteus Mazes, is the best known. Widely used types of per- 
formance tests are the object assembly (page 65 ), various form 
boards, block counting, and block design. Two of these, block 
design and object assembly, are found in the Wechsler-Bellevue 

Cale, 

Norms are generally a 
tests, so that one may use O 
administer the whole scale. 


vailable for the individual performance 
ne or more tests without having to 


The Arthur Scale 
The Arthur Point Scal 


e has been widely used over the age 
Tange of the elementary school. It is made up of performance 
tests taken from various sources; it was first published in 1930 
and revised later in 1947- The later edition is a considerable im- 
Provement over the original insofar as standardization is con- 
cerned, and the Scale is a good example of a performance battery 


74 Individual Intelligence Scales 


designed for children. Figure 3-5 (facing page 55) shows the five 
tests in the Arthur Point. 
There are five tests in the Arthur Point Scale, as follows: 


Knox Cube. The four cubes (see Figure 3-5) are tapped in a 
certain order by the examiner, for example, cube 1, cube 4, 
cube 2, cube 3. The child is told to imitate the tapping order. 
Tapping sequences become longer and more complex, until 
they can no longer be done by the child. 

Seguin Form Board. Ten common geometric forms (Figure 3-5) 
are to be fitted into the right apertures in the board. 

Porteus Mazes. The child is told to trace the shortest path from 
the entrance to the exit in a maze, not lifting the pencil from 
the paper. If he makes an error by crossing a line or entering 
a wrong pathway, he is stopped and given a second trial. Mazes 
increase in difficulty from 3-year level to adult. 

Healy Picture Completion I1. As shown in Figure 3-5, the test 
shows successive scenes in a boy’s life during a typical school 
day. Small pieces or blocks have been cut out of the scene. The 
child must select the appropriate pieces from the box and fit 
them in place. 

Arthur Stencil Design Test. The child must reproduce designs 
of increasing complexity. Standard designs to be copied are 
presented on cards. Each design can be reproduced by fitting 
together stencils in different colors on a solid white card. 
Several stencils are needed for the more detailed designs. 


Scope. The Arthur Performance Scale covers an age range 
from about 4 years to maturity, but is used chiefly with younger 
children. The Scale is employed mainly as a clinical test supple- 
mentary to, or as a substitute for, the Stanford-Binet. 


Scoring. Scores on the sub-tests (based on accuracy and time) 
are first converted into point scores. These are combined and 
converted into mental ages. MA’s are chronological ages which 
are typical for given combined scores. Thus if the average child 
of 10-0 scores 31 points, a score of 31 points becomes an MA of 


Performance Tests in the Schools 75 
AEN MA is divided by CA to give a “performance IQ.” These 
are not equivalent to the 1Q’s from verbal intelligence 
a es and are not to be so regarded. Arthur Scale IQ’s should 
always be described as “Arthur Scale 1Q’s.” 


Performance Tests in the Schools 


1 scores on the Arthur Scale and the 
(.50 or more). The two tests are, 
ame functions, and hence 


Correlations betweet 
Stanford-Binet are fairly high 
however, not measuring exactly the s 
the Arthur IQ is often used as a “performance supplement” to 
the Stanford-Binet 1Q. Arthur IQ's are higher than Stanford- 
Binet 1Q’s when the latter are low, that is, below 90; and this 
discrepancy is especially striking when children are very dull. 
There is evidence that low performance test scores may be in- 
dicative of behavior problems and of emotional instability. This 
result probably grows out of the disturbed child’s poor attention 
Span, poor perception of relations and ineptitude in manual activ- 
ities. Emotional involvement may take expression in bizarre and 
unusual responses. 

For the classroom teacher the main value of a performance 
test lies, perhaps, in the fact that such tests (1) may reflect poor 
language development or lack of language training, and are (2) 
often indicative of cultural and educational handicaps. As 
Pointed out on page 68, 3 comparison with verbal tests often 
reveals, for instance, children whose manual and manipulative 
skills (“concrete intelligence”) run ahead of their verbal facility 
”), Performance tests serve, too, to identify 
the shy and inarticulate child who is brighter than the verbal 
tests show. Performance tests are not especially useful with 
Normal school children over 12 years of age and they rarely 
differentiate significantly among older bright children. 


( “ob ra x 5 
abstract intelligence 


i Case Histories. The following brief case histories will illustrate 
Ow performance tests, when used together with verbal tests 

3 z ) 
May provide a better understanding of a pupil’s capabilities 


76 Individual Intelligence Scales 


Recommendations in most cases must be tentative and subject to 
possible revision in the light of further information. 


Case I. Donald B.: age, 10-2; Stanford-Binet IQ, 92; Arthur 
Scale IQ, 106. : 
Donald is doing poor work in the fifth grade. His father 1s 4 
barber, his mother a clerk (part-time) in a store; neither parent 
went beyond the seventh grade. There are three other children 
in the family, all younger than Donald. There are few books in 
the home, but the family owns a TV set and a new automobile. 
Donald reads the sports page and the comics in the daily news- 
paper, but little else. He talks in brief sentences and is generally 
unresponsive in school. He is a well-grown boy for his age, ĉ 
good athlete, and is well accepted by his classmates. He has 
never been a behavior problem. ; 
Recommendation: Donald’s performance IQ is fourteen points 
higher than his Stanford-Binet IQ. In view of his relatively 
meager abstract intelligence, this boy is probably doing as well 
as we can expect. He may get to high school, but will almost 
certainly not complete more than one year. Vocational training 
seems to be indicated. He will continue to have trouble with 
verbal subjects, but may be very successful at a skilled trade. 


Case II. Joan M.: age, 8-3; Stanford-Binet IQ, 126; Arthur Scale 
IQ, 109. 

Joan is doing excellent work in the fourth grade. Her problem 
is social rather than scholastic. Her father is dead, and her mother, 
a widow, is a successful dress designer. Joan is an only child and 
is alone much of the time. She reads a great deal but has few close 
friends and is often left out of class activities. She has a tendency 
to daydream and is shy and withdrawn. l 

Recommendation: Joan’s low performance IQ, coupled with 
her high Stanford-Binet IQ, indicates a lack of experience with 
“concrete” activities, such as running, playing out-of-doors, 
skipping rope, dancing, and the like. This lack of Opportunity to 
develop manual skills is often found in children reared in a large 


Performance Tests in the Schools 77 
oe mother may be encouraged to meet with other class- 
ea es arrange parties, and invite Joan’s classmates to 
Then E i he ae of the physical education teacher in getting 
ah 6 games should also be sought. The classroom teacher 

see to it, by suggestion and indirection, that Joan is 


inc : r ar 
luded in class parties and out-of-class activities. 


ee Bob WW.: age, 9-2; Stanford-Binet 1Q, 104; Arthur Scale 
ae is doing satisfactory work 
ee with a slight tendency 
a oe He is one of four children, th 
‘ga er than he. Bob’s father is a successful l 
er is a college graduate and a prominent club woman. The 
Parents have decided that Bob, as the only boy, is to be a pro- 
Te man, preferably a physician (his grandfather was a well- 
PE surgeon). They are dissatisfied with Bob’s marks, and 
sure he is intelligent and that the teacher is to blame. 
b Recommendation: Bob is clearly a normal boy. He is not 
Tight, though he probably is brighter than his Stanford-Binet 
fs indicates. The parents must somehow be reconciled to the 
. ct that (1) Bob is not of professional caliber, and (2) a lower 
cational goal (one within Bob’s intellectual grasp) will make 
or a happier boy and probably a much happier life. They must 
>e urged not to scold the boy and thus make him feel more 
terior than he already does. This is a difficult problem, because 
3s the parents—not the child—who have to be “sold” on a 
different program from the one they have planned. 


in the fourth grade. He is shy 
to stammer, especially when 
e other three being girls 
awyer, and his 


SUGGESTIONS FOR FURTHER READING 


General: 


Anastasi, A. Psychological Testing. New York: Macmillan, 1954. 
qaonbach, L. J. Essentials of Psychological Testing. New York: Harper, 


Freeman, F. S. Theory and Practice of Psychological Testing (Rev. 


dition). New York: Holt, 1955- 


78 Individual Intelligence Scales 


Specific: 


Arthur, G. A. A Point Scale of Performance Tests. Revised Form I. 
Manual for Administering and Scoring the Tests. New York: Psycholog!- 
cal Corp., 1947. m 

McNemar, Q. The Revision of the Stanford-Binet Scale: An Analysis 
of the Standardization Data. Boston: Houghton Mifflin, 1942. , 

Terman, L. M., and Merrill, M. A. Measuring Intelligence. Boston: 
Houghton Mifflin, 1937. sds 

Wechsler, D. The Measurement of Adult Intelligence (3rd edition): 
Baltimore: Williams and Wilkins, 1944. 

Wechsler, D. Wechsler Intelligence Scale for Children. Manual. New 
York: Psychological Corp., 1949. 


SUGGESTIONS FOR LABORATORY WORK 


1. Examine the Stanford-Binet items at ages 4, 8, 12, and nee’ 
Adult. Classify the items at each age level as verbal, numerical, nea 
perceptual (for example, mazes and the like), and performance (mn 
ulative). Add other categories if you need them. Which category P 
the largest number of items? to 

2. Have members of the class pair off and test each other. Be sure i 
follow the Manual carefully. Results from this “test” will not be i 
tive of mental ability, to be sure, but following the procedure is a g° 
way to learn about the test. -dren 

3. Repeat (1) and (2) for the Wechsler Intelligence Scale for Childre: 
For (1), sample the items of each test. " re 

4. Go over the Manual of the Arthur Point Scale. If materials 4 
available, administer the Scale to a child before the class. 


QUESTIONS FOR DISCUSSION 


1. What importance do you attach to the fact that test items 1? 
Stanford-Binet become more “verbal” as we go up the age scale. E 
2. Which test, Stanford-Binet or Arthur Point Scale, would you expec! 

to prove more effective in the following situations: , 

a) selecting children for a special class for the gifted 

b) selecting children for remedial work in a “slow” class 

c) studying children with reading problems 

d) testing children with speech defects . . ia 

3. A child taken from public school and entered in a private sc n 

is reported by his mother to have shown an increase in IQ of 20 poin 
after six months in the “new” school. Assuming the story to be true, 


Performance Tests in the Schools 79 


Hi is misleading about it? What might account for the change in the 

4. A high school boy of 16 has a Wechsler-Bellevue IQ of 132. What 
advice would be justified by this fact alone? 

5. Look over the items in the WISC. Which do you think depend 
primarily on schooling? Do the same for the Stanford-Binet. Which test 
is the more “school centered”? 

. 6. Terman states that the vocabulary test gives the closest ap 
tion to total performance on Stanford-Binet. What does this tell us 
about the nature of the Stanford-Binet 1Q? 

7. In deploring the reading interests, TV programs, and voting habits 
of the American adult, critics have said that the average mental age of 
the adult is about 14 (sometimes this is 12 or 15). What does mental age 
Signify here, if anything definable? 

8. Does a child with an 1Q of 80 possess 80 
Sence? Explain your answer. 


proxima- 


per cent of normal intelli- 


CHAPTER 4 


GROUP TESTS OF INTELLIGENCE . 


Group and Individual Tests of Intelligence 


Group tests of intelligence are much like individual tests except 
that (1) they are administered like school examinations, and (2) 
they are objective in form—are answered by checking or circling 
a number or letter, or by marking one of several possible re- 
sponses. Group tests contain both verbal and non-verbal ma- 
terials. Items of the first sort are expressed in words and numbers; 
non-verbal test items, on the other hand, consist of problems 
presented in pictures and diagrams. There is a minimum of 
language and little or no reading required in non-verbal items. 
Intelligence tests for pre-school and first-grade pupils are of 
necessity non-verbal, though directions are given orally. Intelli- 


80 


Group and Individual Tests of Intelligence 81 
Sence tests in the elementary grades contain both verbal and non- 
verbal items. At the high school and college levels, test items are 
Mostly verbal, mathematical, and abstract, but even here many 
Problems are presented in pictorial and spatial forms. 

: Group tests of intelligence confront the examinee with tasks 
like those found in the individual intelligence scales. Both types 
Of test minimize routine school learning and emphasize mental 
alertness by presenting problems which demand reasoning, gen- 
eralization, and the manipulation of “ideas.” But there are differ- 
ences, too, between the two sorts of test. In individual intelli- 
Sence scales, questions are stated orally and are answered orally; 
Moreover, problems are presented one at a time without time 
limit, or a generous limit is allowed. In group intelligence tests, 
questions are printed in a booklet, time limits are fixed, and 
answers are limited to the options provided. The group test is 
More dependent on reading than is the individual test, it is less 
nd it is often disturbing to children who are 
ae limit. When a child’s school work 
lities do not jibe with his 
ck the group test 


flexible in response, a 
easily flustered by a tin 
and/or the teacher's opinion of his abi 
group test score, it may be advisable to che 
Tesult against the Stanford-Binet. Group tests, like individual 
Scales, are concerned almost entirely with the abstract level of 
Intelligence (page 46). 

The first group tests to be widely 
Sence examinations developed for use in the army during World 
War I (1917-1918). Army Alpha consisted of eight sub-tests: 

ollowing Directions, Arithmetic Problems, Best Answers, Dis- 
arranged Sentences, Same-Opposites, Number Series Completion, 
Analogies, and Information. Army Beta made use of diagrams 
and pictures, and directions were given in pantomime. During 
World War II, the Army General Classification Test (AGCT) 
y general ability. Unlike Alpha, 


used were the two intelli- 


Was developed as a measure of 
the items in AGCT were not grouped into sub-tests, but were 
Printed in ascending order of difficulty. A civilian edition of 


AGCT is now available. 


82 Group Tests of Intelligence 


REPRESENTATIVE GROUP TESTS OF 
INTELLIGENCE 


This section will describe several tests of general intelligence 
covering the age range from pre-school to college. These test 
batteries have been chosen for illustration because they are well 
standardized, are widely used in the schools, and are representa- 
tive of a large assortment of group tests designed to measure 
general ability. They are not necessarily the best mental examina- 
tions for every testing situation nor for every school. Selection 
of a “best” test will depend on the objectives which the school 
hopes to achieve, the time available for testing, and the money 
and personnel which the school has available. 


GROUP TESTS OF INTELLIGENCE 


Pintner-Cunningham Primary Test 

California Test of Mental Maturity 

Otis Quick-Scoring Mental Ability Tests 
Kuhlmann-Anderson Intelligence Tests 
Terman-McNemar Test of Mental Ability a 
American Council on Education Psychological Examination 


ANANE 


1. The Pintner-Cunningham Primary Test* 


Description. This test includes seven non-verbal sub-tests de- 
scribed as follows: , 

1. Common Observation: Child marks all of the objects in 4 
given set which fit into some category. (See Figure 4-1, row 1.) 

2. Aesthetic Differences: The child is told to mark the 
“prettiest” (that is, best) of three drawings of the same object- 
(Figure 4-1, row 2.) ; 

3. Associated Objects: The child marks the two objects that 
belong together in each row of pictures—as, for example, the 
hat and the coat. (Figure 4-1, row 3.) 

4. Discrimination of Size: The pupil is instructed to mark the 


* Published by the World Book Company, Yonkers-on-Hudson, N. Y. 


The Pintner-Cunningham Primary Test 83 


FIGURE 4-1 Illustrative Items from the Pintner-Cunningham 
Primary Test 


(ass 


Test 1. Mark the things that Mother uses when she sews her apron. 


oy 
oO 


Test 3. Mark the two things that belong together. 


eee pot 
cture. See how it is drawn. Make another one 


Test 7. Look at each pi 
like it in the dots- 


Reproduced by permission of the World Book Company. 


of the right size for the individual 


Pictured. For each article of clothing—shoes, hat, gloves, etc.— 
One js too large, one is too small, and one is of the right size. 


5. Picture Parts: In this test a series of pictures of increasing 
Complexity is shown. These contain children, toys, animals, and 


other items. The same items are shown outside the “standard” 
Picture, mixed in with other objects. The child is instructed to 


it ‘ ; 
ems of clothing which are 
t=] 


84 Group Tests of Intelligence 


mark all of the objects in this group which appeared in the 
picture. 2 

6. Picture Completion: In each incomplete picture, the pupil 
is asked to locate and mark the correct missing part from among 
several parts shown. 

7. Dot Drawing: The child is to copy drawings which are 
formed by joining dots. See Figure 4-1, row 4. 

All the tests are non-verbal, since most of the children for 
whom the test is intended have not learned to read. Directions 
are given orally. 


Scope. The Pintner-Cunningham Primary Test covers the 
kindergarten, Grade I, and the first half of Grade II. There are 
three equivalent forms, A, B, and C. 


Scoring. Scores from the seven sub-tests are combined to g!v@ 
a total point score. Mental ages corresponding to point scores 
may be read from tables in the Manual. Pintner-Cunningham 
MA’s are chronological ages for which the given point scores 
are typical (see page 33). These MA’s are divided by the child $ 
CA to obtain an IQ. An alternate—and better procedure—s to. 
convert the point scores into deviation IQ’s, following the 
method used in the Wechsler-Bellevue. The mean IQ is, ° 
course, 100 and the SD is 16, equal to that of the Stanford- 
Binet. Pintner-Cunningham IQ’s are not strictly equivalent to 
Stanford-Binet IQ’s, though the same abilities appear to be 
measured by the two scales. Correlations between the two test 
batteries run from .70 to .90 for kindergarten and primary school 
children. This indicates that Pintner-Cunningham is a valid meas- 
ure of the abstract intelligence measured by Stanford-Binet. The 
reliability or stability of Pintner-Cunningham scores is high, 4S 
shown by the close correspondence of one form with another. 


2. The California Test of Mental Maturity (CTMM)* 
Description. These tests contain both verbal and non-verbal 

materials. Sub-tests are grouped under the following five heads: 
* Published by the California Test Bureau, Los Angeles, Calif. 


The California Test of Mental Maturity (CTMM) 85 
mem. z : 2 š a r 
oa ory, spatial relations, logical reasoning, numerical reasoning, 
fro verbal concepts. Each of these categories is represented by 
m tw RTS : : 
ad k ee to four tests. The profile in Figure 4-2 gives the names 
classification of these sub-tests. The first three tests in each 


FIGURE 4-2 Profile for the California Test of Mental Maturity 


Mentol Age ' 
132 144 156 168 180 192 204 Hk 


i Mo.84 96 108 120 
Foctor Pupil's Score 
j 70.0 1.0 120 130 14.0 15.0 16.0 7.0 Uygts. 


test Poble Storey, |Y70 80 90 
s 
2 u 


Yer 


ronet Ont) 


-ED Girt 
> 
1 


gat 
1939. 


= Wad rè „(1 bnmedicre Recett 24 [22] + aomazratensnerzte ay a n a 
2 32 ji S | 2. Deloyed Receli æf] 223 t Fera yonn nyss 
u EL = (romt 1+2) MEAL s 20 25 0 35 o 43 48 
goss -5 3. Sensing Right ond Left 
ò ge š$ ŞE 4. Manipulation of Areos 
ES tora (3 +4) 7 
aij ss[37) 
a er 5. Opposites soa) sso "n Y n “u 
a | $ | ge] simiern fa) ee 8 m 15 
Ss i ge | BS]. Analogies fa] Loe ie a | 
| GE (Fama S mee e 
ia rora(ss647+8) co[m }17 20 25 30 re ee 60 [70] 
| w s e kej e 
a3: | an (9 Number Serier w CA 1 2 3 aa ee SB! 
> i $ 3 2 ho. numerical venit wE ae = ee 
3 | 1 | EE] tess cventny ete) CS E) pu i 
sf 8 = (torar (941040) sE] ! 5 10 > oo s u 75) 
i 3 Ta rorat veren concerts s078] 3 $0 por 5 aus s 
i = S ` TOTAL MENTAL FACTORS 224| 745 | 5560 70 809010000 120 mee 0 160 170 180 190 200205 «212 | 60 
3 LANGUAGE FACTORS wof 43} 10 1520 25 30 35 40-4 5055 60 65 70 75 80 as 90:92 | 40 
Oe (ree ite i) [4] ee 
3 f at So 
3 2 £ won.tancuee factors W102) 15 5955 6 4570 7580.85 90 95 WRAVS NO NS 120 [80 
C e686 € {EME Siore-Losg Store) CC 
CHRONOLOGICAL AGE 136) 84 9 1s no N; aa 155 168 180 192 204 
gos Aerege G D baee —7 40 ee 70 780 90 100 N0 
3 = INTELL, GRADE PLACEMENT 6.7) 2 60 e Bo 90 10.0 NO 
EE ve70 130 140 150 160 170 
be a sven-tergeege ton Mentol Age enti tetera 
oboe Segs g6 W08 120 132 144 156 168 180 192 204 
nen O < 
ae s 
O35 42 = SUMMARY OF DATA For comparison ond prediction, use 1.Q. 
e pa 3 3 percentile norms on page 23 of monval 
s oe ite ve & se cow 
ae p> wt] att OM Mates 
x g jas SCORES MA+CA=I.O. Ploe | orn Sse SF SH SARIS / 
ou A wm a F 
bap a į Ham m5) [rua] [136] |106) [6-7 P 60 || ¿0 || 40 || 50 || 50 || #0 || 30 || 5 ] 
a A 
3 tis ux r 
So Gg 2 ona [43 (3 nafaa) [ea] | es [52] [42 |e es eI 
OSB 3| ttt [ faes 
Eo = ma) (eal. E 
å uxt 
S| ams, LOLS) rel (78 | [8] | Mr, Lelee ee leoleo o Leo lao jL | 
$ poenam te Son Uprices, wagiton, ot Gp RAL Sie ih Alege 
S CALI Oe pr CALIFORNIA TEST, Fee esas UNDER IATERRATIOWAL COPYRIGHT 
= DESAM te canta asenves URDEA PAN-AMERICAN COPTRICRT SeLSeUPRIaED WC EA: 


Reproduced by permission of the California Test Bureau. 


86 Group Tests of Intelligence 


California battery are designed to measure visual acuity, auditory 
acuity, and motor co-ordination. These tests, which are in 4 
separate booklet, are rough screening devices intended to 
identify children too handicapped to be correctly classified by 
the test battery. 


Scope. The California test series covers the range from kinder- 

garten to college. The five batteries are as follows: 

1, Pre-primary level: kindergarten and first grade 

2. Primary level: grades 1-3 

3. Elementary level: grades 4-8 

4. Junior high level: grades 7-9 

5. Advanced level: grades 10-college and adult. 
These test batteries require about 114 hours working time. They 
are relatively easy to administer and to score. 


Scoring. Separate scores are obtained for each of the five areas 
(called “factors”) into which the sub-tests have been grouped: 
There are also scores (and mental ages) based on (1) the lan- 
guage or verbal tests alone, (2) the non-language tests alone, ant 
(3) the test as a whole. From these three scores, separate MA’s 
may be read from tables in the Manual. Language, non-language 
and total-test IQ’s are found by dividing the appropriate N 
by the child’s CA. Percentile ranks are also provided for each 0 
the five “mental factors.” These PR’s may also be read from 
appropriate tables. 

A special feature of the CTMM is the use of a profile or chart 
as an aid in analysis and diagnosis. As shown in Figure 4-2, the 
highs and lows of a pupil’s performance in the five areas may be 
readily seen from their positions on the profile. Along the right- 
hand margin of the chart, percentile ranks (PR’s) are entere 
for each factor, as well as for total score and for the language 
and non-language parts of the test. These PR’s give the — 
standing on a scale of one hundred points ( page 33). If the PE 
is 50, the child stands just in the middle of his age group; if his 
PR is 70, then 70 per cent of his age group fall below him in 
the given test. 


Otis Quick-Scoring Mental Ability Tests 87 


The validity of the CTMM was determined through its cor- 
relations with Stanford-Binet and other standard mental tests. 
T he tests appear to be very homogeneous (to measure the same 
abilities) over the age range from pre-school to college. The 
reliability of the language, non-language, and total scores is high: 
reliability: coefficients of these part scores and of the factor scores 


ra ; à 
ange from .87 to .95 over grades 4-6. 


3. Otis Quick-Scoring Mental Ability Tests* 

Description. These tests differ from many group tests of in- 
telligence in that the test items are not grouped into separate 
Sub-tests according to type of item. Instead, the different items— 
analogies, arithmetic problems, opposites and the like, are printed 
In a continuous repetitive pattern, SO that items of a certain sort 
(opposites, for example) follow each other at stated intervals. 


» 
his arrangement is sometimes called a “scrambled” test, or more 


Precisely a spiral omnibus arrangement. Items are progressively 
ish of the test. 


More difficult from the start to the fin € - 
The following items are like those in the Otis Beta Test,** an 


examination prepared for grades 4-9. 


1 2 3 4 5 
1, Which of the five things below is soft? ( ) () (x) TER, 
l. glass 2,stone 3. cotton 4.iron 5.ice 
1 2 3 45 
2. A robin is a kind of (9@ O00) 
l. plant 2. bird 3. worm 4. fish 5. flower 
1 2 3 4 5 


3. Hat is to head as shoe is to CYVEV MH CILD 
5. glove 


l.arm 2.1 3. foot + fit 
eg 3-f 12 2 4 6 
4 North CP CVI Cd GS) 
lehor 2.east 3.west down 5. south 
i 2 # # 8 


5. At five cents each, how many pencils can be () (x) 0) 0) ©) 


bought for 40 cents? 
1.45 2.9 3.200 45 5-1 
‘orld Book Company, Yonkers, N. Y. 


* Published by the W 
** The first two items are samples from the Beta Test. Other items are like 


those found in the test. 


88 Group Tests of Intelligence 


Scope. The Otis Quick-Scoring Tests cover the age range from 
Grade I through college. There are three batteries, as follows: 

Alpha Test (90 items) non-verbal; grades 1-4 

Beta Test (80 items) verbal, numerical, and spatial; grades 4-9 


Gamma Test (80 items) verbal primarily; high school and 
college 


Scoring. The Otis tests are easy to administer, and scoring is 
facilitated by a cutout stencil which can be superimposed on the 
test booklet. The tests are virtually “self-administering.” There 
is a single time limit, which varies from twenty to thirty minutes. 
Mental age equivalents to total score are read from tables in the 
Manual. The Otis IQ’s are deviation scores and are measures 
of brightness. These 1Q’s are only generally comparable to 
Stanford-Binet IQ’s; the two “scores” are not equivalent. The 
reliability of the Otis tests is high. 


4. Kuhlmann-Anderson Intelligence Tests* 


Description. This is a series of thirty-nine separate sub-tests 
grouped into nine overlapping test batteries. The sub-tests include 
verbal and non-verbal materials. The early levels are entirely 
pictorial, but the tests become more verbal and abstract as We 
go up the age scale and finally are entirely verbal. Each test 
battery consists of ten sub-tests. 


Scope. Each of the ten batteries is printed in a separate booklet 
and is designed to cover one or more grade levels, as follows: 
Kindergarten: sub-tests 1-10 


Grade 1 sub-tests 4-— 13 
Grade 2 sub-tests 8—17 
Grade 3 sub-tests 12 — 21 
Grade 4 sub-tests 15 — 24 
Grade 5 sub-tests 19 — 28 
Grade 6 sub-tests 22 — 31 


* Published by the Personnel Press, Inc., 180 Nassau Street, Princeton, N. J. 


Terman-McNemar Test of Mental Ability 89 


Grade 7-8  sub-tests 25 — 34 

Grade 9-12 sub-tests 30 — 39 
Administration of the K-A tests is somewhat more difficult than 
with the Otis, since the tests in the batteries are often separately 
tmed. K-A requires from 30 to 45 minutes to administer. 


Scoring. In setting up a scoring plan, the authors of K-A have 
employed what is called the “median mental age” method. This 
may be described briefly as follows. Each of the ten sub-tests 
mna battery yields a mental age. These MA’s (see page 32) are 
chronological ages for which a given score is typical or average. 
Thus if the children who are 10 years and 2 months old earn 
1n general a score of 21 ona given sub-test, then the score of 21 
Corresponds to or is equivalent to a MA of 10-2 on this sub- 
test. MA’s are read from tables in the Manual. The median MA 
is the median of the ten sub-test MA’s.* This is taken to be the 
Most representative measure of a child’s over-all ability. 

An IQ for the battery is found by dividing the median MA 
by the child’s life age, or CA. This IQ is not equivalent to the 
Stanford-Binet IQ, though it is related to it. The K-A tests 
Measure verbal or abstract intelligence primarily, especially at 
The reliability of the K-A—as shown by 
—is very high. Reliability coefficients 
des 1 though 9 separately: these range 


the upper age levels. 
the stability of its test scores 
have been computed for gra 
from .89 to .97. 


5. Terman-McNemar Test of Mental Ability** 


Description. This test is designed for high school students. It is 
2 measure largely of ability to read and comprehend fairly diffi- 
Cult prose. Two numerical sub-tests found in an earlier edition 
of the test were eliminated in order to render the test more 
unified in content. As it stands, we have a highly verbal battery. 


* When ten scores are arranged in order of size, the point (or score) found 
des onang off five scores from either end of the series is the typical value or 
aa (see page 20). 
World Book Company, Yonkers-on-the-Hudson, N. Y. 


90 Group Tests of Intelligence 


There are seven sub-tests, described as follows: information, 
synonyms, logical selection, classification, analogies, opposites, 
and best answer. Sample items and instructions for each item 
type are shown in Figure 4-3. These items are easier than are 
the items found in the test proper and are for illustration. Items 
in the test are graded in difficulty from easy to hard. 


FIGURE 4-3 Sample Items from the Terman-McNemar Test 
of Mental Ability (Form C) 


TEST 1. INFORMATION 
Mark the answer space which hay the same number ay the word that makes the sentence TRUE. 
Saupre. Our first President was 
Adams 2 Washington 3 Lincola 4 Jefferson S Monroe... 


TEST 2. SYNONYMS 


Mark the answer space which has the same number as the word which has the SAME or most nearly 
the same meaning as the beginning word of each line. 


Sammie. correct—1 neat 2 fair 3 right 4 poor 


TEST 3. LOGICAL SELECTION 


Mark the answer space which has the same number as the word which tells what the thing ALWAYS 
has or ALWAYS involves. 


SAMPLE. A cat always has 
d kittens 2 spots. 3 milk 4 mouse S hair 


TEST 4. CLASSIFICATION 


In each line below, four of the words beling together. Pick out the ONE WORD which does not 
belong with the others, and mark the answer «pare bearing its number. 


1 dog 2cat 3horse 4 chicken 5 cow 
Sampces. 


Chop 7run Sstand 9 skip 10 walk... 


TEST 6. ANALOGIES 


Study the samples carefully. 
Ear is to hear as eye is to 


ijem ~ loery 2 glasses 3 spy 4 wink S see.. 
Hat is to head as shoe is to 


Oar 7leg 8foot Oft 10 glove 
DO THEM ALL LIKE THE SAMPLES 


TEST 6. OPPOSITES 


Mark the answer space which has the sime number as the word which is OPPOSITE, or most nearly 
Opposite, in meaning to the beginning word uf each line. 


Saure. north— 1 hot 2east 3 west 4down S south 


TEST 7. BEST ANSWER 


Read cach statement and mark the answer space which has the sime number as the answer which 
you think is RIST. 


Samper. We shouht not put a burning match in the wastebasket because 
1 Matches cost money. 2 We might need a match later. 
3 It might go out. 4 It might start a fire. p 


Reproduced by permission of the World Book Company. 


American Council on Education Psychological Examination 91 


Scope. The Terman-McNemar is planned specifically for 
grades 7 through 12 and for college freshmen. 


Scoring. Total raw score is converted into a scaled score IQ, 
which is closely related to Stanford-Binet IQ. Scores may also 
be expressed as MA’s and as percentile ranks. The working time 
for the test is about 50 minutes. Terman-McNemar comes in 
two equivalent forms. In the construction of the test, a careful 
item analysis was made (page 21+) in order to weed out unsatis- 
factory items. This is offered as evidence of the test’s validity. 
The reliability of the Terman-McNemar is reported to be .96 


for a single age level. 


6. American Council on Education 
Psychological Examination* 
Description. This battery of tes 
lastic aptitude, or learning ability in school. It comes in two 
forms, one for high-school students and another for college 
freshmen. The college test consists of six sub-tests, as follows: 


1. Arithmetic problems: 20 problems in multiple-choice form, 
of the “mental arithmetic” variety. 

2. Completion: 30 items in multiple-choice form. The test 
demands word knowledge and definitions. 

3. Figure Analogies: 30 multiple-choice items. Analogies in- 

areas, angles, spatial arrangements. 

Itiple-choice items which demand 


ts is designed to measure scho- 


volve geometric forms, 
4. Same-Opposite: 50 mu 
vocabulary and word knowledge. 
Number Series: 30 items to be completed “logically” with 


v 


appropriate numbers. 
6. Verbal Analogies: 40 items: relation-finding in verbal terms. 


In the ACE, college form, sub-tests 1, 3, and 5 are combined to 
Sve a quantitative, or Q, score; sub-tests 2, 4 and 6 are combined 
L, score. Each sub-test is separately timed 


to give a linguistic, or 3 
by a practice exercise. In the high-school 


and each is preceded 


* Published by the Educational Testing Service, Princeton, N. J. 


92 Group Tests of Intelligence 


form of the ACE, tests 3 and 6 are dropped, leaving four = 
tests. Completion and same-opposite are combined to give the 
score, and arithmetic and number series to give the Q score. 


Scope. The ACE is the most difficult of the general intelligence 
tests described so far. Testing time varies from about forty mN- 
utes (high school form) to sixty minutes (college form). 


Scoring. The three scores from the ACE—the quantitative, the 
linguistic, and the total—may be converted into percentile ranks. 
Extensive norms (in PR’s) are published annually covering test 
results from previous years. Q scores have been found to cor- 
relate with achievement (grades) in mathematics and science, 
but the L score has the higher correlation with general achieve- 
ment in high school. seal] 

The predictive validity (page 31) of the ACE, as eee 
over several years, is high. ACE correlates from .40 to „60 with 
college grades; its correlations with Stanford-Binet average wae 
.65. The reliability coefficients of Q, L, and total score are 4 
very high. One feature of the ACE is the publication of norms 
for different groups. Separate norms are available for boys, a) 
girls and for three types of college—4-year, 2-year gumo 
and teachers’ colleges. Although the 4-year colleges aner 
higher mean scores, there is much overlapping of 4-yean FE 
and teachers’ college scores. Differential norms are a distinct al 
to educational counselors. 


HOW GROUP INTELLIGENCE TESTS ARE USED 
IN THE SCHOOL 
Survey Measures 


In general, the group test of intelligence is used (1) to give 
an over-all measure of a child’s abstract ability (often an 1Q), 
(2) to provide a basis for educational counseling and guidance, 
and (3) to give a basis for prognosis. The total score on a group 
test is useful to the school administrator, the classroom teachers 
and the parent. Standard tests supply the school administratot 


Counseling, Guidance, and Prognosis 93 


with a systematic record of how different schools, and classes 
within a given school, compare in general ability to learn. The 
classroom teacher gets needed information concerning the abili- 
ties of individual pupils. Within a given class, the spread of 
ability is often disturbingly wide. A teacher can tell from his 
test scores whether John ‘and Mary are doing the caliber of 
work which can reasonably be expected of them, and whether he 
1s pitching his instruction at the comprehension level of the class 
as a whole. Parents can plan the future education of their chil- 
dren more intelligently when they know the level of perform- 


ance to be expected of them. And students can set their academic 
listically when they are aware 


and occupational goals more rea 
wn by comparison of 


of their strengths and weaknesses as sho 
their scores with norms for their age level. 


Counseling, Guidance, and Prognosis 

The total score from a group test—the IQ or other type score 
—is most useful as a measure of a pupil’s over-all academic ability. 
For guidance and counseling the teacher can use to greater ad- 
Vantage the sub-tests or part scores from the test battery. The 
Profile of the California Test of Mental Maturity, for instance, 
has been especially designed for diagnosis. From the language 
and non-language 1Q’s, a teacher can judge whether a child is 
predominantly Zyerbal-minded” or “object-minded”; and from 
the five “factor” scores on the profile he can judge how pro- 
ficient a pupil is in memory, logical reasoning, verbal concepts, 
Spatial-perceptual relations, and numerical reasoning. In Figure 
4-2, for example, low scores in reasoning and vocabulary indi- 
Cate poor academic ability—that is, the pupil lacks the ability 
to solve problems efficiently by means of symbols (numbers 
and words). High scores in these factors reveal good academic 
aptitude and, when combined with other traits, suggest that the 
Pupil is capable of more advanced, and perhaps professional, 
training. Low scores in spatial relations reveal little promise of 
Success in geometry, mechanical drawing, and perhaps manual 


94 Group Tests of Intelligence 


training. High scores here, plus high scores in the other factors, 
forecast aptitude for engineering and architecture. The memory 
factor is based on too meager a sample to provide a reliable 
measure of a pupil’s functional memory; the score here might 
be significant, however, if very high or very low. 

Part scores, like those from the CTMM, are helpful in giving 
the classroom teacher clues as to a child’s abilities. But these 
scores must always be interpreted with caution (page 68). The 
sub-tests upon which such judgments are based are quite ae 
and are often too narrow to permit of a broad prediction. Logs 
differences in part scores should always be substantiated =! 
further investigation; they should jibe with other tests, clo 
grades, and with the teacher’s judgment from observation of th 
pupil’s classroom work. ahi 

The Otis Quick-Scoring Mental Ability Tests and the oer 
mann-Anderson Intelligence tests are primarily useful as aged 
measures of the general level of mental functioning. The su 
tests of the Kuhlmann-Anderson are fairly complex. The a! 
very wisely, do not recommend that specific scores (menta 
ages) from sub-tests be interpreted as measuring definite pert 
logical functions. Wide variations in score from sub-test to sub i 
test for a given child may be significant, however, of gaps " 
training or in native ability. 

The Pintner-Cunningham Primary Test is most useful, Pid 
haps, in helping the teacher and parent decide whether a chi : 
is mature enough mentally to do first-grade work. Entrance in 
first grade should not depend solely upon the MA or CA, al 
ever. Children who are babyish in their social behavior an 
poorly developed physically are poor prospects for first grade, 
no matter how high their IQ’s. 

Because of its high verbal content, the Terman-McNemat 
Test of Mental Ability is one of the best predictors of high- 
school achievement. The homogeneity of the sub-tests (their 
high degree of relatedness) renders the test less useful for diag- 
nosis of a student’s strengths and weaknesses. The American 


Counseling, Guidance, and Prognosis 95 


Council on Education Psychological Examination is a good pre- 
dictor of college work. This battery measures initiative in attack- 
ing new problems, and mental speed and facility and good work 
habits, as well as abstract ability. ACE is also useful in guidance, 
since it provides three scores—a quantitative (Q), a linguistic 
(L), and a total. The ACE for high-school students is used as 
4 screening test for prospective college freshmen and as a basis 
for counselling high-school students who plan to continue their 
education beyond high school. The L score is perhaps most 
predictive of general ‘college work, because of the great impor- 
tance of reading comprehension in college courses. The Q score 
has predictive value for science and mathematics, especially when 
confirmed by other indicators (grades, teachers’ judgment). 

The ACE (page 91) provides separate norms for 2-year, 
4-year and teachers’ colleges. The 4-year colleges have the 
higher average scores, but variation in score from one type 
of college to another is very large, as is also variation in score 
Within each college type. A student's chances of entering college 
and staying there will depend to a considerable degree upon the 
college’ he chooses (see page 115 for discussion of local and 
nation-wide norms) or to which he is admitted. Only superior 
students should be encouraged to apply to high-standard col- 
leges, and not all of these are good risks unless they have the 
personal qualities to go along with academic potential. Good 
Personality and a capacity for hard work may not, in them- 
selves, enhance a student's chances of being accepted into an 
A-grade college, but they will help him stay there once he is in. 
Students with relatively low scores on ACE may be quite suc- 
Cessful in colleges in which the scholastic standards are not too 
high. In any event, knowledge of his academic strengths and 
Weaknesses should be helpful to a student, whether he plans 
further school work or not. 

Norms for the ACE for high-school students are based upon 
Selected groups and may be much too high for al) high-school 
Seniors. In fact, his PR on the ACE may be unfair (misleadingly 


96 Group Tests of Intelligence 


low) for the high-school senior of modest intellectual endow- 
ment who does not plan to enter college. Such a youngster may 
rank well up among 18-year-olds in the population but relatively 
low among those of college caliber. 


Limitations of the Group Intelligence Test 


Intelligence tests have definite limitations, and teachers and 
parents must not expect the impossible from them. For one 
thing, a group intelligence test cannot increase intelligence, as 
parents sometimes seem to think it should. Again, a group test 
1Q is not necessarily a good measure of a pupil’s drive to accom- 
plish, or of that dogged determination to stick to an unpleasant 
task and see it through. Nor is a fairly high IQ (even a high IQ) 
always accompanied by emotional stability, good judgment, and 
initiative. All these traits are related to good intellect, but the 
relationship is by no means perfect. Many persons of average 
intellectual ability succeed in college, whereas many of greater 
potential fall by the wayside. Intelligence is a necessary, but 


is not a sufficient, attribute for high accomplishment in school 
or in life. 


WHAT TO LOOK FOR IN A GROUP 
INTELLIGENCE TEST 


The adequacy of a group intelligence test is judged by its 
validity, reliability, scoring methods, and norms. The object in 
giving the test, its cost, and such factors as time and personnel 
must also be considered. 


Validity. A test is valid, as we have noted, if it measures what it 
purports to measure (page 30). Group intelligence tests have 
been validated, in general, against various criteria judged to be 
indicative of intellect (page 31). Some of these criteria are 
school grades, ratings for ability, and other intelligence tests. 
All such criteria are admittedly indirect and fallible; at the 
same time, they represent measures with which any authentic 


Limitations of the Group Intelligence Test 97 


test of intelligence must correlate. Perhaps the best criterion of 
the validity of a group test is its success in predicting perform- 
ance in tasks judged to require intelligence—in school, in busi- 
ness, in the armed forces, or in a profession. Judged by correla- 
tional criteria and predictive power, most of the widely used 
group intelligence tests may be accepted as valid, though never 


perfectly so. 


Reliability. We have already had occasion to use the term 
reliability with reference to individual intelligence tests. If a 
child earns an IQ of 108 on one form of a group test and three 
months later achieves an IQ of 106, or 108, or even 110 on a 
second form—that is, scores within a few points of the first 
determination, and if most persons examined show similarly 
consistent results, we regard the test as reliable. Reliability de- 
pends essentially upon the stability or consistency of a score. 
When properly given and scored, most standard group intelli- 


gence tests are highly reliable. 


Scoring. Group intelligence tests are first scored in arbitrary 
points, one or more points being assigned to each correct an- 
swer, Point or raw scores are frequently converted into MA’s 
and IQ’s. Such IQ’s are related to, but are not equivalent to, 
Stanford-Binet 1Q’s. Group test 1Q’s are adequate for screening 
and often are satisfactory for guidance; but the individual intelli- 
gence test IQ is a more searching and more nearly constant 
measure of a child’s talents (page 61). In addition to MA’s 
and [Q’s, many group tests also provide PR’s for raw or obtained 
scores. These PR’s are readily interpreted: they show how high 
the pupil ranks on a scale of one hundred points. If a high-school 
senior has a PR of 85 in the (L) part of the ACE and a PR of 
80 on the (Q) part, he should be a good risk for college work. 

A second way of rendering the scores from different sub-tests 
in a battery comparable is through the use of standard scores. 
Point scores may be converted into a standard score scale with 
2 convenient mean and o. The sub-tests of a group intelligence 


98 Group Tests of Intelligence 


test usually differ in content, in length, and in difficulty. These 
part-scores cannot be compared—or combined—as they stand. 
But when converted into a common scale, they can be added to 
give a total in which each sub-test has the same weight. 


Norms. Norms (page 40) are typical measures of achieve- 
ment. Norms may be nation-wide or local (page 115). Local 
norms are often fairer for a given group in that they take into 
account the conditions within a given city or state. National 
norms are most useful for wide comparisons and as standards at 
which to aim. Norms for college freshmen will generally be 


> 
much too high for high-school graduates in general. College 


FIGURE 4-4 Norms for Various Occupational Groups on the 
Army General Classification Test 


Civilian Occupation AGCT Standard Score 
60 70 80 90 100 110 120 130 140 150 


Accountant 

Medical student 
Teacher 

Lawyer 

Bookeeper, general 
Stenographer 
Reporter 

Clerk, general 
Purchasing agent 
Salesman 
Telephone repairman 
Artist 

Musician, instrumental 
Toolmoker 

Printer 

Machinist 

Policeman 

Sales clerk 
Electrician 
Machinist's helper 
Welder, combination 
Plumber 

Carpenter, general 
Automobile repairman 
Tractor driver 
Painter, general 
Truck driver, heavy 
Cook 

Laborer 

Barber 

Miner 

Farm worker 


Men in general 


10th Percentile” ‘90th Percentile 
25th Percentile 75th Percentile 


Reproduced by permission of Harper & Brothers. 


Limitations of the Group Intelligence Test 99 


freshmen are a selection of high-school graduates according to 
academic proficiency. Group intelligence-test norms are usually 
given in terms of age level, but they may be in terms of grade 
level. 

Figure 4-4 shows the norms for certain occupations on the 
Army General Classification Test, used in the armed forces in 
World War II. The higher scores are achieved by men with 
the most extensive training, and are probably the resultant of 
both intelligence and training. The more intelligent men are able 
to undertake the more exacting training, and this training enables 
their native talent to express itself. It is interesting to note the 
large degree of overlapping in score from one occupation to 
another. It seems evident that many men are functioning at a 
level below their native capacity. 

The educational expectation of a child whose group test IQ 
is 90, 100, or 115 may be read with sufficient accuracy for most 
purposes from Table 3-3. 

Other Factors Which May Govern the Choice of a Group Intelli- 
gence Test. In addition to the formal requirements to be met by a 
group test discussed above, there are other considerations which 
enter into the suitability of a test for a given school system. 
Among the more important are time available for testing, per- 
sonnel, cost, and acceptability. Catalogues provide data on cost 
and time allowances—most testing periods are set to fit com- 
fortably into a class period. In most cases, teachers can administer 
group tests with a minimum of instruction, and scoring can be 
done with stencils. Acceptability of a test depends on whether 
the teachers and the community look with favor upon standard 
tests. Much of the disfavor with which parents once regarded 
mental tests has fortunately disappeared, though one still encoun- 
ters skepticism as to their value. In initiating a testing program 
it is always wise to avoid tests which contain what appear to be 
trick items and those which resemble puzzles. Such tests are 
likely to be labeled frivolous by teachers and parents. Some 
parents still think that the object of a mental test is to describe 


100 Group Tests of Intelligence 


their children as dull or mentally abnormal. When they see the 
value of a standard test in providing a better understanding of a 
child’s capabilities, their objections disappear. From the cata- 
logues listed on page 253, the teacher or administrator should be 
able to find the test suitable for a given situation. 


SUGGESTIONS FOR FURTHER READING 


Cronbach, L. J. Essentials of Psychological Testing. New York: 
Harper, 1949. 

Freeman, F. S. Theory and Practice of Psychological Testing (Rev- 
edition). New York: Holt, 1955. 

Goodenough, F. L. Mental Testing. New York: Rinchart, 1949. 

Noll, V. H. Introduction to Educational Measurement. Boston: Hough- 
ton Mifflin, 1957. 

Thorndike, R. L. and Hagen, E. Measurement and Evaluation in 
Psychology and Education. New York: John Wiley, 1955. 


SUGGESTIONS FOR LABORATORY WORK 


1. Administer three or four standard group tests of intelligence to the 
class and have the students score their own papers. If the test is for 
young children, cut the time limits in half. 

2. Select one of the tests taken in (1). Examine the Manual for the 
author’s treatment of validity, reliability, scoring methods, and norms. 
Summarize these data. 

3. In another of the tests from (1), count the number of items which, 
in your Opinion, are verbal, numerical, and spatial-perceptual. In which 
group did you do best? Worst? Does your result jibe with what you 
know about your abilities? 


QUESTIONS FOR DISCUSSION 


1. Is a group test of intelligence anything more than a scholastic 
aptitude test? What else does it add to your knowledge of a pupil? 

2. Why is the score on a reliable intelligence test usually a better 
estimate of a pupil’s ability than is the rating of the teacher? 

3. Why do we get different IQ’s for the same pupil from different 
intelligence tests? 

4. Is the group test of intelligence more useful in an academic than 
in a vocational high school? 


Limitations of the Group Intelligence Test 101 
5. Suppose you are a sixth-grade teacher. You have administered a 
standard group-intelligence test to your class. What uses do you think 
you might make from a knowledge of these children’s 1Q’s? 
6. A pupil has taken the CTMM (page 84). In counseling this child, 
what help might you get from a wide difference in his language and non- 


language 1Q’s? 


Ni 


CHAPTER 5 


EDUCATIONAL ACHIEVEMENT TESTS 


The purpose of the educational achievement test—like that of 
the ordinary school examination—is to discover how much a 
pupil knows about the subjects he has studied or is studying. 
Both the general intelligence test and the educational achieve- 
ment examination measure aptitude for school work (“abstract 
intelligence”). The difference between the two is one of empha- 
sis rather than of purpose. The intelligence test, as we have 
seen, tries to gauge mental alertness apart from specific school 
knowledge—that is, it is concerned primarily with the efficiency 
of mental processes as exhibited in problems which demand 
learning ability, perceptual keenness, memory, reasoning, and the 


102 


Educational Achievement Tests 103 


like. The educational achievement test is also concerned with 
mental processes, but only insofar as they are demonstrated in 
a student’s performance in English composition, arithmetic, his- 
tory, or science. 

The distinction between the two sorts of test is not always 
clean-cut, and there is much overlap in content and in abilities 
called upon. All intelligence tests depend in some degree on 
previous learning, and all educational tests depend in some part on 
native keenness. Educational achievement tests predict future 
school performance as well as or better than intelligence tests. 
Achievement in the elementary school, for example, forecasts 
achievement in high school; and performance in arithmetic pre- 
dicts later performance in algebra. But prediction is strengthened 
when an intelligence test is added to the achievement battery. 
Perhaps the general intelligence test is most useful when we want 
an estimate of potential aptitude, the achievement test when we 
want a measure of present school standing and probable success 
in later school work. Both tests provide valuable information, and 
each supplements the other. 

Educational achievement tests are useful (1) for survey pur- 
poses—that is, to determine a class’s standing in relation to some 
norm, and (2) for guidance and evaluation—that is, to provide a 
clearer understanding of what individual pupils have learned—or 
failed to learn—in specific school subjects. A better understanding 
of strengths and weaknesses is a major objective of a testing pro- 
gram. Remedial work can be undertaken more intelligently and 
teaching improved when we know what errors a pupil is making 
Consistently and what misconceptions and gaps in training led to 
these errors. 

Achievement tests are often used for sectioning pupils in order 
to improve working conditions within the classroom. Thus pupils 
may be classified into high, average, and low ability groups on the 
basis of over-all educational standing, or sectioned within a 
grade into fast, medium, and slow learners. Prediction of later 


104 Educational Achievement Tests 


school success on the basis of educational achievement tests is 
considerably more accurate than are forecasts based on conven- 
tional school marks. 


THE SUPERIORITY OF STANDARD ACHIEVEMENT 
TESTS OVER ROUTINE EXAMINATIONS 


Standard achievement tests are superior to teacher-made tests 
in three principal respect. . 


1. The Achievement Test Is Better Planned. The usual teacher- 
made test in algebra or French is composed of questions and 
problems covering topics which one teacher believes worth know- 
ing about his subject. Usually materials are drawn from a single 
textbook. Such a test is valuable as a measure of progress in learn- 
ing, but it is not very broad in coverage and docs not permit 
«comparisons with the achievement of students in other schools. 

The standard educational achievement examination, on the 
other hand, is compiled after an analysis of many widely used 
textbooks and various courses of study and sets of examinations. 
Thus it represents a consensus—the pooled judgment of many 
competent teachers and testing specialists. Drawing materials 
from many sources insures a representative sampling of subject 
matter. Occasionally a teacher will complain that a general 
achievement test contains questions about topics or books (in 
English literature, for example) which his class has not studied, 
and that on this account the test is unfair. This is often true, but 
the criticism is not as damaging as it may seem. Few classes have 
covered equally well all of the topics treated in a comprehensive 
achievement test. Some teachers will have emphasized one topic, 
some another, but by and large these inequalities will even up for 
the test as a whole. Rarely will a school have a general and marked 
advantage (or disadvantage) over another school in educational 
experience, unless the teaching, the curriculum, and/or the caliber 
of the students are exceptionally good or poor. When gross 


The Achievement Test Is More Objective 105 


inequalities are revealed in test scores, the reason for such differ- 
ences should be sought. It seems hardly wise on that account to 
abandon the test. 


2. The Achievement Test Is More Objective. The standard 
achievement test is more objective than the teacher-made ex- 
amination. This means that in an achievement test, grades received 
by students depend to a minimum degree on the personal opin- 
ions, likes, and dislikes of the scorer, In the traditional essay 
examination, a high degree of subjectivity is almost inevitably 
present: the mark given an answer depends on what one teacher 
regards as important and significant. 


3. The Achievement Test Lays Down More Exact Specifications. 
The educational achievement test is more logically planned than 
the ordinary teacher-made examination, because makers of stand- 
ard tests draw up specifications for an examination. These lists 
are often lengthy and quite specific, but in general they can be 
reduced to two—knowledge and application. Thus test items are 
selected to reveal a pupil’s information and understanding of 
facts, as well as his acquired skills in, for example, reading or _ 


arithmetic. Again, items are chosen to reveal a pupil’s ability to 
s, to interpret, draw conclusions from 


blems. The second of these specifications 
is the more important, but the first is not to be dismissed lightly 
as being a matter of “mere memory.” Students cannot write 
good English prose, nor can they read difficult passages in history 
and literature, without adequate vocabulary. Even in so “logical” 
student cannot solve “originals” in 
bright he is) unless he knows the 


apply known principle 
given data, and solve pro 


a subject as mathematics, a 
geometry (no matter how 
preceding propositions. Rote memory, of course, is rarely 
enough. The older spelling bees found how many detached and 
isolated words a child can spell—though often he had little idea 
of what the words meant. Modern spelling tests try to discover 
whether a child can spell a word and also knows its meaning well 


106 Educational Achievement Tests 


enough to use it correctly in a sentence—that is, in context. The 
second method (application as well as knowledge) provides a 
better measure of a child’s usable vocabulary. 


GENERAL EDUCATIONAL ACHIEVEMENT 
BATTERIES 


The present section will describe five representative achieve- 
ment test batteries (chosen from many) which are designed to 
measure general educati¢nal achievement in the elementary 
grades and in the high school. 

. The Stanford Achievement Test (SAT) * 

. The Metropolitan Achievement Tests (MAT) 

. The California Achievement Tests (CAT) 

. The Cooperative General Achievement Tests (GAT) 
. The Sequential Tests of Educational Progress (STEP) 


= 


h pw bh 


All these tests make some provision for the analytic study of a 
student’s strong and weak points through a comparison of sub-test 
scores. Part scores are often represented comparatively on a 
graph or profile. 


The Stanford Achievement Test (SAT)** 


Description. The SAT consists of overlapping sub-tests 
grouped at four ability levels from grade 2 through grade 9. All 
four of the batteries contain three tests of paragraph meaning, 
word meaning and spelling (these are essentially measures of 
language skills); and two tests of arithmetic reasoning and 
arithmetic computation (number or quantitative skills). All these 
tests are multiple-choice in form. In addition to these five sub- 
tests, the Intermediate Battery (for grades 5 and 6) and the Ad- 
vanced Battery (for grades 7, 8, and 9) include four other tests: 
language, social studies, natural science, and study skills. The 


* These batteries are often referred to in abbreviated form by the capital 
letters. 


** Published by the World Book Company, Yonkers, N. Y. 


The Stanford Achievement Test (SAT) 107 


Language Test contains items in capitalization, punctuation, and 
sentence structure. The Social Studies Test covers fundamentals 
of history, geography, and civics. The Study Skills Test is an 
ingenious attempt to discover how well a student reads maps, 
interprets graphs and tables, and uses references. This informa- 
tion is important to the teacher, since many pupils regularly skip 
all tables and graphs unless supervised. 

There are five forms for each battery. The Primary Battery 
is printed in a single booklet of eight pages and takes a little more 


FIGURE 5-1 Sample Items from the Stanford Achievement 
Test, Primary Battery, Form K 


Test I. Paragraph Meaning. 
Directions: “Find the one word that belongs in each space, and draw a line 


under the word. Do not write in the spaces” 
Baby pets me. 
I drink milk. 


I say “Mew, mew.” 


lama 


Cow kitten pony child 


Test IV. Arithmetic Reasoning. 
Directions: “Now look at the pictures. Put your finger on the little chair 
in the top box. That is right. Next to the little chair are some candles. 
Put a cross on the shortest candle. Make a mark like this.” (Illustrate 


on the board, making a large X). 


* Lad L 


“Do you see the row of clocks? Put a big cross on the clock that says it is noon.” 


Reproduced by permission of the World Book Company. 


108 Educational Achievement Tests 


than two hours to administer. Figure 5-1 shows some of the items 
from the Primary Battery. The Elementary Battery (for grades 3 
and 4) contains six sub-tests: paragraph meaning, word meaning, 
spelling, arithmetic reasoning, arithmetic computation, and lan- 
guage. The Intermediate Battery requires almost four hours and 
will, of course, have to be spread over several class periods. The 
authors of the tests have drawn up a convenient testing schedule, 
with approximate times ‘for each sub-test. 


Scope. The scope of the SAT is as follows: 


1. Primary Battery: end of grade 1, grade 2, and first half of 
grade 3 

2. Elementary Battery: grades 3 and 4 

3. Intermediate Battery: grades 5 and 6 


4. Advanced Battery: grades 7, 8 and 9 


These four achievement tests cover the fundamentals taught in 
most schools over the elementary grades through grade 9. 


Scoring and Norms. All of the sub-tests are objective in form, 
so that scoring can be readily accomplished by stencils or scor- 
ing keys. Norms are in grade equivalents to raw scores, and also 
in percentiles for sub-test scores. 

There are two types of norms. The first, called the wodal-age 
grade norm, is recommended for individual diagnosis, that is, 
for evaluating the scores of individual pupils. From tables in the 
Manual, a pupil’s scores can be compared with those earned by 
children who are typical for age and grade. A second norm, the 
total-group grade norm, is based upon the performance of all 
children in a given grade. These norms, given in tables in the 
Manual, are recommended by the authors when one wishes to 
evaluate a class average. Raw scores on the sub-tests are con- 
verted into standard score units so that they may be combined 
and compared (page 38). 

The validity of the SAT is high. The tests possess content 
validity and the correlations of the batteries with grades and 


The Metropolitan Achievement Tests (MAT) 109 


other criteria demonstrate excellent predictive validity. The re- 
liability of the various batteries is also satisfactory. 


The Metropolitan Achievement Tests (MAT)* 


Description. The MAT includes five test batteries with a range 
from grade 1 through the first half of grade 9. All of the test 
batteries contain sub-tests of reading and arithmetic; spelling is 
added after grade 2, and language usafye after grade 3. At the 
intermediate and advanced levels ther’ are ten sub-tests in all: 
reading, vocabulary, arithmetic fundamentals, arithmetic prob- 
lems, English, literature, social studies (history), social studies 
(geography), science, and spelling. In addition to the complete 
batteries, partial test batteries are available for use at the inter- 
mediate and advanced levels. These include the skill subjects— 
reading, arithmetic, English, and spelling—plus vocabulary and 
arithmetic problems. All tests at a given level are printed in a 
single booklet. 

The MAT provides a comprehensive survey of a pupil’s educa- 
tional attainment. Morcover, the profile chart (sce Figure 5-2) 
printed on the last page of the test booklet and the class ability 
sheet allow the teacher to identify the student’s weak points, to 
correct errors consistently made, to study a pupil’s rate of 
progress from time to time, and to group pupils for instruction 
or review. Tests in arithmetic and reading are available as sep- 
arates and may be used when it is not feasible to administer the 
whole battery. 

Scope. MAT includes the following batteries: 


Primary Battery I: grade 1 and beginning grade 2 
Primary Battery II: grade 2 and beginning grade 3 
Elementary Battery: grades 3 and 4 and beginning grade 5 
Intermediate Battery: grade 5 up to the first half of grade 7 
5. Advanced Battery: grade 7 up to the first half of grade 9 
MAT covers a wide range of material taught in grades 1-9. Test 


pPwne 


* Published by the World Book Company, Yonkers, N.Y. 


FIGURE 5-2 Profile Chart for the Metropolitan Achievement 
Tests 


County. Henrico. 


INDIVIDUAL PROFILE CHART 
METROPOLITAN ACHEIVEMENT TESTS: INTERMEDIATE BATTERY — COMPLETE 


* Test 2 | Test3 | Tests | TestS | test 6 | Test? | Testa | Test 9 | Test 10 K * t 
voca. | arin. | aimn. | eng: | umer: Jwist.&| ceos- | sci- | secu] penr 
ULARY | FUND. PROB. USH ATURE | CIVICS | RAPHY | ENCE ING 
} f £ f 100 N 
E 
ae Bae as 


13-9 f 
13-8 Į f 
13-3 + es F 
is f 
13-4 iy 
1322 
13-1 Bo 
13-0 
12-11 
ey 
12-8 ne 
12-7 H 
a126 [ 
8 i 
d - o 
2 12-3 nE 
Ẹ 12-2 = 
313-1 Į H 
$ 12-0 3 
2-1 + t 65 2 
Ẹ 11-10 I 3 
11-9 IN 
911-8 T N elk 
“i Eeg F 
11-5 4 PEA Ẹ 
11-4 
11-3 [ 
Heed 
11-0 4 ae 55 s 
fae 
R | 
10-7 i 50 
10-6 
10-5 | Į 
10-3 
10-2 | T E 4s 
1929 
an d F 
3-10 
a i OR 
3-7 ] 
rs i E 
3- J. T 
E 4 Š 
9-2 $ 
Sii i | 
8. 


f ei 
Reproduced by permission of the World Book Company. 


The California Achievement Tests (CAT) 111 


batteries require from one hour (primary) to about four hours 
(advanced). 

Scoring and Norms. The MAT is easy to administer and to 
score. There are three types of norms: age, grade, and percentile. 
Norms are given also in a standard score scale which is based 
on the assumption of a normal distribution of test ability in the 
sixth grade. Standard or scaled scores are comparable from bat- 
tery to battery in the same subject, but not from test to test within 
a given battery. Figure 5-2 shows the’profile of George Fergu- 
son, who is 11 years and 8 months old. George is a sixth-grade 
student, and the MAT was administered on February 6 when 
he was midway through the grade (that is, at 6.5). George’s 
scores on the ten sub-tests have been converted into age- 
equivalents from the appropriate tables in the “Key and Direc- 
tions for Scoring.” His subject ages (also called educational ages 
or EA’s) have been entered on the chart and joined by short 
straight lines to give the profile of his school achievement. A 
straight line drawn horizontally across the chart through George’s 
chronological age of 11-8 shows immediately in what subjects 
he is above or below the scores typical for his age level. 

George’s raw scores were converted into age- instead of 
grade-equivalents. These EA’s show whether George is acceler- 
ated or retarded as compared with children of his own age. EA’s 
are useful in guidance. Grade equivalents give the grade levels to 
which various scores correspond. A profile plotted from grade- 
equivalents tells us whether a pupil is above or below his present 
grade level in his various subjects. Both norms are useful. Grade 
useful when comparisons with national or 


norms are especially 
local norms are to be made; age norms are most useful when 


diagnosis of a pupil’s strengths and weaknesses is wanted. 
Both the validity and the reliability of the MAT are satisfactory 


as judged by the usual criteria. 


The California Achievement Tests (CAT)* 
Description. The CAT have been organized into five batteries 


* Published by the California Test Bureau, Los Angeles, Calif. 


112 Educational Achievement Tests 


designed to cover the ability range from grade 1 to college. The 
tests are survey in nature and are concerned primarily with skills 
in six areas: reading vocabulary, reading comprehension, arith- 
metic reasoning, arithmetic fundamentals, mechanics of English 
and spelling. The authors of CAT believe that tests in these areas 
are more valuable than are tests in such subjects as social studies, 
where the content varies widely from school to school. The 
California Tests empha‘ize power rather than speed, the time 
required for the Element«ry Battery being more than two hours. 
CAT stresses the use or the separate tests in diagnosis. Except 
in the case of spelling, for example, the tests in the six areas are 
subdivided into sections, each dealing with some important aspect 
of the subject. For example, in the Elementary Battery, reading 
comprehension (Test 2) is analyzed into (1) following direc- 
tions, (2) reference skills, and (3) interpretation of material. 
Test 3, arithmetic reasoning, is broken down into (1) meanings, 
(2) signs and symbols, and (3) problems. Scores from each of 
these sub-divisions are plotted on a profile like that of Figure 5-2, 
usually in grade-equivalent units. The analysis of a pupil’s per- 
formance is carried still further by a second grouping together 
of items which presumably measure essentially common func- 
tions. Thus within the division of punctuation under Test 5, 
mechanics of English, items are grouped into those which in- 
volve commas, periods, question marks, quotation marks. Under 
the heading of addition in Test 4, arithmetic fundamentals, items 
are grouped under zeros, carrying, fractions, and decimals. The 
number of item classifications under a given test varies from 50 
to more than 100. A special chart enables the scorer to analyze 
the pupil’s achievement over a wide range of these elements. 
Careful examination of specific item-groups may, to be sure, 
reveal why a pupil fails consistently to use decimals correctly 
or to understand fractions; or it may tell us where he is weak in 
punctuation, or in vocabulary, or in spelling. The CAT at least 
makes an attempt to keep the individual pupil from being lost 1n 
an “average.” At the same time, it must be remembered that in- 


Co-operative General Achievement Tests (GAT) 113 


dividual diagnosis based on a few items is always tentative and 
may be misleading. 


Scope. The CAT consists of the following batteries, which 
cover the educational levels described. 
Lower Primary: grades 1 and 2 
Upper Primary: grades 3 and 4 
Elementary: grades 455, and 6 
Junior High: grades ; , 8, and 9 
Advanced: grades 9 0 14 


“Awn 


Scoring. Raw scores on the tests may be converted by tables 
into age, grade, and percentile-within-grade norms. The sub-tests 
are objective in form, easy to administer, and easy to score. The 
six tests of the batteries have satisfactory reliability, but the re- 
liabilities of the various sub-divisions are quite low because of the 
few items included in some groupings (often only one or two). 


Validity is high for the whole test. 


Co-operative General Achievement Tests (GAT)* 


vement tests deal with three fields or 


Description. These achie 
areas—Test I covers social studies, Test II, natural sciences, and 


Test III, mathematics. Each test battery consists of two divisions: 
Part I, which deals with fundamental terms, concepts and defini- 
tions; and Part II, which covers applications of knowledge, in- 
terpretation, and comprehension. The battery has been planned 
for grades 10, 11, and 12, but it is probably too difficult for all 
but superior tenth and eleventh graders. The battery is objective 
in form throughout. 


Scope. GAT is a power test designed for the upper school 
grades and for college freshmen. Each test requires from 40 to 
60 minutes. 

Scoring. The tests are all multiple-choice, and are easy to 
administer and to score. [tems are graphic, pictorial, and verbal. 


* Published by the Educational Testing Service, Princeton, N. J. 


114 Educational Achievement Tests 


Norms in scaled scores and percentiles are given for high-school 
students and college freshmen. GAT is probably most useful in 
the counseling of high- school students as to the subject fields in 
which they show the greatest promise. 


The Sequential Tests of Educational Progress (STEP)* 


Description. As the term “sequential” implies, this battery is 
designed to measure a sfiident’s progress in learning as he goes 
from the elementary grides to college. The tests deal with 
critical skills in seven academic areas: essay tests, planned to 
provide standardized tests in writing prose; listening compre- 
hension tests, in which the examiner reads a passage z and asks ques- 
tions designed to call out comprehension, interpretation, and 
evaluation; reading tests, covering a wide range of content; w. rit- 
ing tests, planned to measure the student’s ability to express ideas; 
mathematics tests, which contain items over a wide range of 
subject matter and difficulty; science tests, dealing with the appli- 
cation of scientific know ledge to a variety of situations; and 
social studies tests, designed to show progress in social and civic 
development. 


Scope. STEP is designed to measure achievement over the 
following levels: 


Level 1—freshmen and sophomore years of college 
Level 2—grades 10, 11, and 12 

Level 3—grades 7, 8, and 9 

Level 4—grades 4, 5, and 6 


It should be noted that Level 1 is the highest level academically, 
Level 4 the lowest. STEP a attempts to rev veal continuity in mental 
growth and learning from the bottom to the top level. 


Scoring and Norms. There are two equivalent forms (A and B) 
of each test in STEP except the essay tests, for which there are 
four forms. There are grade and percentile norms. Scoring is by 


* Published by the Educational Testing Service, Princeton, N. J. 


The Sequential Tests of Educational Progress (STEP) 115 


stencils. A profile chart allows the examiner to analyze a pupil’s 
performances on the several functions measured by the battery. 


GENERAL ACHIEVEMENT TESTS 
IN THE SCHOOLS 


We have seen how the general educational achievement test 
gives the academic level of a pupil or of'a class, and how the test 
profile reveals strengths and weaknesse, in a variety of subjects 
and processes. Further illustration of now educational achieve- 
ment tests may be utilized in (1) evaluation, (2) diagnosis, and 
(3) prediction will be given in this section. 


; Evaluation. Suppose that Miss Clark has given the SAT to her 
sixth-grade class of twenty-six pupils. She finds her class mean 
(average) on the test battery to be about equal to the local norms 
for the sixth grade, but slightly below the national norm as given 
in the Manual. Does this result mean that Miss Clark is doing a 
poor job because local norms are less valuable than national? The 
answer is No, since a number of factors affect achievement in a 
given school system or a single school, and some of these may 
cause local norms to be lower or higher than national. Among 


these factors are the following: 


1. Retardation as a consequence of strict promotional stand- 
ards and practices. Much retardation will lower local norms, 
whereas the weeding out of poor students (by transfer to 
special classes, for example) will raise local norms. 

2. Promotion by age irrespective of achievement. This fairly 


common practice will lead to a progressive lowering of 


local grade norms. 
3. Previous experience of pupils with standard objective tests. 


This factor varies widely and often affects local norms. 

4. Coaching in the tests themselves. Sometimes teachers coach 
pupils in materials akin to or identical with those found in 
the tests. “Teaching for the tests” is bad practice and should 


116 Educational Achievement Tests 


be discouraged whenever possible. Coached pupils usually 
raise the class’s performance. 

5. Selection. Children from a poor socio-economic back- 
ground generally score lower on standard tests, whereas 
children from good neighborhoods score higher, especially 
on the verbal tests. 

6. Motivation. Children do not try hard on tests if the teacher’s 
attitude is negativ, or if the parents think achievement 
tests are worthless—,and say so loudly and often. 

7. Transfers, drop-ou.s. These children may affect local 
norms, usually adversely. 


In some private schools in which pupils are generally of high 
caliber because of stringent selection procedures, local norms will 
often be found to be considerably above national norms based 
on public school results. In a large city system we can expect an 
occasional sixth-grade class to fall below national norms even 
when the city as a whole is up to national standards. But when 
a number of classes fall below national standards, the curriculum, 
the teaching methods, the promotional standards and other con- 
ditions in the school and the community should be examined. 


Diagnosis. In looking over her test results for the sixth grade, 
Miss Clark may find that Harry is far below the sixth-grade norm 
in reading and that Sue is below the norm in arithmetic. At the 
same time, Mary reads at eighth-grade level, and John (the 
youngest child in the class) is up to the ninth-grade norm 1n 
science. Individual differences like these are the rule rather than 
the exception in most elementary classes. It is fairly easy for Miss 
Clark to prescribe further reading for Mary, and to stimulate John 
to carry out an individual project in science—for example, classi- 
fying the birds in the local community. The below-average chil- 
dren often present real problems, and as a result they are given 
more of the teacher’s time and effort than the bright children. 
If the extra time which Miss Clark can devote to Harry and Sue 
is insufficient to bring these children up to the sixth-grade levels 


The Sequential Tests of Educational Progress (STEP) 117 


in reading and arithmetic, they should be referred to special 
classes, if such are available. The larger the number of below- 
average children, the more difficult is Miss Clark’s task, and the 
more likely she is to neglect the bright children. 

It should be noted, as a further point, that the printed norm 
(local or national) does not necessarily establish the optimum 
level of performance for every pupil in the sixth (or any other) 
grade. If Norman, whose IQ is 120, is`just on the sixth-grade 
norm in reading and arithmetic, he is nst performing up to ex- 
pectation—his scores should be above tye norm for his grade. On 
the other hand, if Bill, whose IQ is a modest 94, is at or above 
the norm for the sixth grade in reading and arithmetic, he is 
actually doing better than we can reasonably expect of him. The 
intelligence of the child must always be considered in deciding 
whether his school work is “normal” for the grade. 

Sometimes Miss Clark will suspect from a pupil’s sullen be- 
havior, or open aggressiveness, Or his tendency to whimper at 
the slightest provocation that emotional factors are causing or 
contributing to his difficulties in school. Such a pupil should be 
referred to the school psychologist (if there is one) or to the 
school physician. The clinical psychologist is often able through 
tests and interviews to get a clearer idea of a pupil’s difficulties 
than can the teacher. The teacher should visit a child’s home if 
she suspects that parents and home environment are involved, as 
they often are. Corrective measures (when possible) can be more 
intelligently applied when causal factors making for undesirable 
conduct and/or poor school work are known, rather than sur- 
mised from superficial impressions. 

Prediction. Whether it will be profitable for a student to take 
science or mathematics in high school or college can be forecast 
with considerable assurance from his performance on standard 
tests. Prediction of later success is usually improved when tests 
given in elementary schools are combined with a good intelli- 
gence test. Intelligence and achievement tests are regularly 
utilized in many schools in the selection and placement of stu- 


118 Educational Achievement Tests 


dents in courses of study. The combination of achievement tests 

. . ni > . . . I 
and special aptitude tests is valuable in predicting a student’s 
success in a professional school—in law or medicine, for example. 


ACHIEVEMENT TESTS IN SPECIAL 
SUBJECT AREAS 


In the preceding section, we described five general achieve- 
ment batteries designed :,o assess academic standing in school. In 
the present section, we saall consider several representative sub- 
ject-matter achievement sts. These include tests of reading and 
arithmetic, as well as tests planned to determine mental maturity 
(readiness) and proficiency in special subjects. Of the various 
subject-matter tests, those in reading and arithmetic are most 
often given, since they represent fundamental skills upon which 
school achievement largely depends. Subject-matter tests are 
found, of course, in the general achievement batteries, as well as 
in separate form. The tests listed below were selected as being 
typical of a very large number available. 


- Metropolitan Readiness Tests 

Iowa Silent Reading Tests 

Co-operative Mathematical Tests 
Evaluation and Adjustment Series 
Co-operative French Test (elementary) 
Co-operative Science Test 


Aw Pwne 


Metropolitan Readiness Tests* 


Description. The primary objective of these tests is to find 
whether a child is sufficiently mature to undertake the study of 
reading. But the tests are concerned also with “readiness” for 
arithmetic, and with general physical and mental maturity. The 
six tests in the battery may be described as follows: 


(1) Word Meaning: child selects picture named by the ex- 
aminer. 


* Published by the World Book Company, Yonkers, N. Y. 


Metropolitan Readiness Tests 119 


(2) Sentences: same as (1) except that the examiner uses 
sentences and phrases instead of single words. 

(3) Information: child marks the picture corresponding to the 
examiner's oral description. 

(4) Matching: child must recognize similarities and differences 
in pictures, geometrical forms, numbers, letters, words. 

(5) Numbers: child must demonstrate a knowledge of number 
concepts and carry out simple o erations. 

(6) Copying: child is required to cupy simple graphic forms, 
as well as numbers and letters. ` 


All the test items are pictorial—that is, non-verbal. The test 
has two forms. The battery is essentially a prognostic test: its 
purpose is to forecast a child’s mental, sensory-motor, and mus- 
cular readiness for first-grade work. Figure 5-3 shows sample 
items. 


Scope. The test is for the end of kindergarten and the begin- 
ning of first grade. The test requires about sixty minutes working 
time. 


Scoring. Norms in percentile ranks allow the teacher to estimate 

a pupil’s readiness for reading (based on tests 1-4), readiness for 

arithmetic (test 5), and general maturity for first-grade work 

(tests 1-6), In addition, a child’s score is given a rating from A to 

E. An A rating denotes an excellent risk, the other letters a lesser 

HRA of certainty down to E, which implies almost certain 
ailure. 


Prognostic Value of the Metropolitan Readiness Tests. The test 
battery as a whole forecasts general maturity for the first grade, 
but its sub-tests may be used diagnostically to provide informa- 
tion about individual children. If Ben makes low scores on tests 1, 
2, 3, and perhaps 4, for example, he has inadequate maturity in 
language for first-grade work. Or he has too little experience 
with and comprehension of language generally. If Louise earns 
low scores on tests 4 and 6, she is probably too immature to under- 


120 Educational Achievement Tests 


FIGURE 5-3 Sample Items from Metropolitan Readiness Tests 


Test 1. Word Meaning. in the first row, the child marks the baby; 
in the second row, the house. 


Test 4. Matching. In each row the child circles the picture identical 
to the one in the circular frame. 


Reproduced by permission of the World Book Company. 


take written work. As these two tests measure visual perception 
and hand-eye co-ordination, an eye examination and training 10 
motor skills may be indicated. Test 5 (numbers) shows readiness 
for number work, and the child who scores high should be able 
to use numerical symbols. Test 6 (copying) has proved to be a 
good measure of physical and mental maturity. From this test, 
the teacher can pick up tendencies to reversals in drawing and 
writing, phenomena fairly common at this age level. If a chil 

has not developed reading readiness by age 7, he should be 
examined by a physician, an oculist, and perhaps a psychologist 


lowa Silent Reading Tests 121 


lowa Silent Reading Tests* 

Description. This test consists of two batteries, one for elemen- 
tary schools and one for high schools and colleges. Both batteries 
measure reading rate, vocabulary, sentence comprehension, 
paragraph reading, and skill in locating information. Speed is an 
element in the battery, as well as power. The Elementary Test 
includes a reading comprehension test called “directed reading,” 
and the Advanced Test a test of poetry somprehension. 


Scope. The two batteries cover the -ollowing range: 


Elementary Test (four forms)—grades 4-8 
Advanced Test (four forms)—high school and college fresh- 


men 
Working time for either battery is about 50 minutes. 
Scoring. There are six sub-tests in the Elementary Test: 


Rate and comprehension in reading connected prose. 
Directed reading of prose to get answers. 

Vocabulary and work meaning. 

Paragraph reading: selecting the main idea and adding 
appropriate details. 
5, Sentence meaning: 


context. 
6. Work-study skills: alphabetizing and using an index. 


aPwnr 


understanding brief sentences out of 


Test 1 yields two scores (rate and comprehension) and Test 6 
two scores (alphabetizing and use of index). Tests 2, 3, 4, and 5 
yield one score each. These 8 sub-scores are converted into scaled 
scores by means of tables appended to each test. Scaled scores 
may be plotted on a profile to show the variations in perform- 
ance. Percentile norms are also provided by grade for each sub- 
test and for total score. There are age and grade equivalents to 
total score. 

The Iowa Test can be expected to spot (a) the extremely slow 


* Published by the World Book Company, Yonkers, N. Y. 


122 Educational Achievement Tests 


reader, (b) the careless reader who fails to follow directions, 
omits necessary details, and skims over important facts, and (c) 
the rapid but uncomprehending reader. 


Co-operative Mathematics Test for Grades 7, 8, and 9* 


Description. This test consists of four parts: I, skills; II, facts, 
terms, and concepts; III: applications; IV, appreciation. Ques- 
tions and problems cover'basic arithmetic as well as simple algebra 
and geometry. The test may be used for survey purposes, but it 
is perhaps more valuable it} evaluation and guidance. Sample items 
from the test are shown in Figure 5-4. 


Evaluation and Adjustment Series (High School)** 


Description. This is an extensive battery of subject-matter and 
other tests (twenty-four so far and more to be added) designed 
for use in high schools. The tests cover such traditional areas 
as algebra, biology, geometry, physics, history, and literature. In 
addition, there are tests of reading comprehension, “problems 1n 
democracy,” health knowledge, and study skills. The content of 
the tests has been drawn from standard textbooks, courses of 
study, and professional literature. Tests may be administered as 
separates or as parts of a general survey. 


Scope. For survey and diagnosis in grades 9 through 12. There 
are two forms for most tests. 


Scoring. Raw scores are converted into scaled scores for each 
test, so that comparisons may be made from test to test. Results 
may also be compared graphically by means of a profile. Many 
of the tests provide charts showing what score is to be expected 
at given IQ levels. IQ’s are from the Terman-McNemar Test of 
Mental Ability. The reliability of the various tests in the battery 
is satisfactory. The separate tests require from 45 minutes to an 
hour of working time. 


* Published by the Educational Testing Service, Princeton, N. J. 
°° Published by the World Book Company, Yonkers, N. Y. 


Co-operative French Test (Elementary) 123 


FIGURE 5-4 Sample Multiple-Choice Items from the Coopera- 
tive Mathematics Test for Grades 7, 8, and 9 


From Part l, Skills: 
39. 16 equals 
3-1 } 


39-2 


From Part II, Facts, Terms, ond Concepts? 


7. Which of the following is a unit in the 
metric system? 
Ounce 
Centimeter 
Yard 
Bushel 
Gross sss eo ew ee ees 


From Part Ill, Applications: 

24, Ifaman spends 12% of his salary on bonds, 
and buys a $37.50 bond cach month, what 
is his monthly salary? 

24-1 $312.50 
24-2 $312.60 

24-3 $350 

24-4 $376.20 
245 $450. .-- 


From Part IV, Appreciation: 
20. Which of the following has no volume? 
20-1 Cylinder 
20-2 Cone 
20-3 Square 


20-4 Cup 
20-5 Rectangular box - +--+ +% 


Reproduced by permission of the Educational Testing Service. 


Co-operative French Test (Elementary)* 
for this test call for knowledge 


Description. The specifications 
ary, plus the ability to use the 


of French grammar and vocabul 


* Published by the Educational Testing Service, Princeton, N. J. 


124 


language in reading and translation. The test has three parts: 
vocabulary, grammar, and reading. The vocabulary section is a | 
multiple-choice test of fifty words. Grammar (thirty-five items) 
requires the selection of one of five choices to complete correctly 
the translation of an English sentence into French. In the reading 
section, forty incomplete sentences in French are to be completed 
from a list of five options, The reliability of the test is high. 


FIGURE 5-5 


Educational Achievement Tests 


1 
{ 


Sample Multiple-Choice ltems from the Coopera- 


tive Science est for Grades 7, 8, and 9 


From Part I, Informational Background: 


3. 


From Part Il, Terms and Concepts: 


It is believed that dinosaurs lost out in 
their struggle for existence chiefly because | 
3-1 they were killed by man for food. 
3-2 man could not tame them. | 
3-3 they were not adapted to changes | 
that took place in the earth's sur- 
face and climate. 
3—4 they were not fitted to cat plant food. 
3-5 they had no brains. Ps wma f 


2. The instrument used to look at and study 


the surface of the moon and the planets is 
the 


2-1 galvanoscope. 


2-2 microscope. 

2-3 telescope. 

2-4 electroscope. 

2-5 radiometer... s wwa saasa 


If two plants of the same species but of 
different varieties are mated, the offspring 
are called 

13-1 mongrels. 

13-2 sports. 

13-3 _ biennials, 

13-4 lentils. 

13-5 hybrids. ..... 


Part Ill, Comprehension and Interpretotion: 


This test consists of multiple-choice items to be answered after reading 
@ paragraph of scientific prose or examining a table. The selection must 
be understood and interpreted. 


Reproduced by permission of the Educational Testing Service. 


Co-operative Science Test (Grades 7, 8, and 9) 125 


Scope. This test is intended for the first two years of high 
school or for the first year of college study of French. 


Scoring. Scaled scores are provided for each of the three parts 
of the test and for the total. There are percentile norms for high- 
school and for college classes. Working time for the test is forty 
minutes. 


Co-operative Science Test (Grades 7, 8, and 9)* 


Description. There are three parts tcythis test: Part I, informa- 
tion and background; Part II, terms and concepts; Part III, com- 
prehension and interpretation. The test is planned to measure 
knowledge and application. Part I is in multiple-choice form. 
Part III consists of readings in science, each reading followed by 
questions designed to assess the student’s understanding, as well 
as his ability to interpret and apply what he has read. (Figure 5-5) 


Scope. Grade 9 and superior seventh and eighth graders. 


Scoring. There are scaled scores for the three parts and for the 
total. Percentile norms are given for grades 7, 8, and 9. The 
working time for the whole test is about eighty minutes. Re- 
liability of the whole test is high. 


WHAT TO LOOK FOR IN AN EDUCATIONAL 
ACHIEVEMENT TEST 


The suitability of an educational achievement test for a given 
situation must be determined from an examination of its validity, 
its reliability, its scaling techniques, and its norms. The cost, 
time, and personnel needed to administer and score the tests must 
also be considered. These same requirements apply to group tests 
of intelligence. Each of the main characteristics of a mental test, 
except perhaps validity, has been commented on at appropriate 
places throughout this chapter. A summary of the relevant data 
under each category will now be offered. 


* Published by the Educational Testing Service, Princeton, N. J. 


126 Educational Achievement Tests 


Validity. An educational achievement test is valid when it 
measures what it undertakes to measure. Most subject-matter 
tests possess content validity. An arithmetic test or a geography 
test or a reading test, for example, is valid by definition when it 
contains a sampling of arithmetic problems, geography questions, 
and paragraphs to be read. The standardized educational test 1s 
made up of items taken from a variety of sources: widely used 
textbooks, courses of study, examination questions, and outlines. 
The items in tentative form are checked by experienced teachers 
and are put into objective orm by test construction specialists. A 
broad selection of items insures a comprehensive sampling of 
materials. 

One validation technique employed in some educational tests 
is the following. The test is provisionally drawn up and is admin- 
istered to an experimental group; only those items are retained 
which show an increasing percentage passing with age or with 
grade. Other techniques of item analysis will be described 1m 
Chapter 9. All of these procedures are directed toward selecting 
questions which will work together as a team, cover a wide range 
of difficulty, and be related closely in content (be homogeneous). 
The standard test, when finally made, is a compact and closely 
knit instrument for measuring what it purports to measure. Data 
on validation procedures will be found in most Manuals which 
accompany standardized achievement tests. 


Reliability. The reliability of the educational achievement tests 
described in this chapter has been generally reported as high. 
This means that parallel forms of the test correlate highly (ovet 
.90 in most cases) so that we may have confidence in the stability 
of a child’s score. In most test Manuals, reliability is expressed by 
the “reliability coefficient,” also called the self-correlation of the 
test, or by the standard error of an obtained score. The correla- 
tion of a test with itself (by retest) or between alternate forms of 
the same test tells us how closely the pupils’ scores “stay put 
The standard error of a score tells us how much fluctuation t° 


Co-operative Science Test (Grades 7, 8, and 9) 127 


expect in a child’s score upon retest. If the standard error is three 
points, for example, the odds are two to one that Bill’s score of 64 
will, on a second trial on the test, vary up or down from the first 
determination by not more than three points. The smaller the SE 
of a test score, the greater the stability of the obtained score. The 
SE of a test score gives us more information concerning reliability 
than does the reliability coefficient alone (page 29). 

Scores obtained on most standard tests are highly stable, but 
part scores based on a relatively few items are variable and may 
be quite unreliable. Conclusions as tc, strengths and weaknesses 
based on unstable scores are always tentative, and must be re- 
garded as suggestive only. 


Scaling. Most educational achievement tests are first scored in 
arbitrarily assigned points, so many points being given for a 
correct answer. These point scores are usually converted into 
scaled scores by means of tables printed at the end of the sub- 
test. The meaning of standard scores and of T-scores has been 
discussed in Chapter 2. Raw or obtained scores (point scores) 
from the sub-tests of a battery differ in length, difficulty, and 
content; they cannot be compared or combined as they stand. 
When scaled, scores expressed in different units are comparable. 
Scaled scores—and sometimes raw scores—are usually converted 
into age and/or grade equivalents—into the age and grade values 
which correspond on an average to the given scores. If the 
average child of 9 years and 4 months earns a score of 38 on an 
Arithmetic Fundamentals Test, then the score of 38 “equals” an 
educational age (EA) of 9-4. If children who are half way 
through the seventh grade (that is, at 7.5) earn a mean score of 
63 on a Reading Test, the score of 63 has a grade equivalent of 
Tas 

The educational age (EA) may be divided by the chronological 


age (CA) to give an educational quotient (EQ). (£o = aah 


This EQ is a measure of acceleration and is somewhat analogous 


128 Educational Achievement Tests 


to the IQ. The EA and EQ are often useful, provided they are 
taken to refer only to the tests on which they are based and are 
not thought of as general indices. 


Norms. Norms are typical measures of performance. In a 
standard educational test, the mean score made by a large and 
representative group of fifth-grade pupils is the norm for fifth- 
grade children on this test. Norms are expressed in age and grade 
equivalents, as percentile ranks, and in the form of scaled scores. 
A child’s grade placement is found by computing the tenths of 
the school year which have passed before the test was given. If 
the school year begins about September 1 and ends June 15, a 
sixth-grade class tested in the period between March 16 and April 
15 is assigned the grade position of 6.7—the class is 7/10 into the 
school year. Most standard educational achievement tests report 
nation-wide norms in their Manuals. These typical performances 
are based on the achievements of large groups of children from 
all over the country. As we have pointed out, local norms (for 
city or state or both) are often better measures of pupil achieve- 
ment. Any pupil’s scores relative to those of other pupils should 
be evaluated in terms of his effort, his intelligence, and his home 
and community. 


Other Factors in the Selection of a Test. The cost of a testing pro- 
gram, the personnel required, and the time it will take from other 
school activities—all these must be considered in adopting a given 
test or tests. Tests which fit easily into a class period, which can 
be scored objectively (by means of stencils) by a clerk, and 
which are acceptable in form and content to teachers and to 
parents are in general least disruptive of the school’s routine. 


SUGGESTIONS FOR FURTHER READING 


Anastasi, A. Psychological Testing. New York: Macmillan, 1954. 

Greene, H. A., Jorgensen, A. N., and Gerberich, J. R. Measurement 
and Evaluation in the Elementary School (2nd edition). New York: 
Longmans, Green, 1953. 


Co-operative Science Test (Grades 7, 8, and 9) 129 


om A. M. Measurement in Education. New York: McGraw-Hill, 
53. 

: Traxler, A. E. et al. Introduction to Testing and the Use of Test Results 
in Public Schools. New York: Harper, 1953. 


SUGGESTIONS FOR LABORATORY WORK 


1. Administer two or three standardized achievement tests to the 
class, cutting the time to one-half if necessary. Have students score 
their own tests and plot profiles where called for. 

2. Analyze a standard reading test, listing the objectives which you 
think the author had in mind. Do you agzee that these objectives were 
fulfilled? i 

3. Select a test taken in (1). Consult the Manual for data on validity, 


reliability, scaling procedures, and norms. 


QUESTIONS FOR DISCUSSION 


1. For which of the following purposes would a standardized achieve- 


ment test be useful: 

(1) To discover which pupils have not mastered multiplication and 
division of fractions. 

(2) To determine which pupils are reading too slowly. 

(3) To determine for the class which punctuation skills need further 
work. 

(4) To section the class into two groups for teaching arithmetic. 

(5) To discover the subjects in which each pupil is strong and in 


which weak. 
2. A teacher lists the foll 
civics: 
(1) To present facts in the field. 
(2) To prepare the class for the duties of citizenship. 
(3) To further appreciation of democracy. 
(4) To foster criticism of governmental processes. 
(5) To aid pupils in thinking about problems in government. 
Which of these objectives is the teacher most likely to fulfill? 
3. The Manual of Test ABC states that the test may be used for 
diagnostic purposes. What do you look for in a test to determine whether 


it has diagnostic value? 
4. A professor of Eng 

English teacher nothing t! 

and an interview. Do you agree? 


owing as objectives of a course in history and 


lish states that batteries of standard tests tell the 
hat cannot be better found out from a theme 


130 Educational Achievement Tests 


5. Why is it necessary that sub-tests in a battery to be used for diag- 
nosis have high reliability? 

6. In some schools, teachers prepare for a testing program by having 
students review older standard examinations. What effect could this 
have on the students’ morale? On the comparability of test results from 
school to school? Is it good educational practice? 

7. In School A, the pupils in grades 4 to 7 are given the California 
Achievement Tests. Scores are recorded in grade equivalents only. What 
other types of scores would bz valuable? Why? 

8. The Manual of a reading test reports a correlation of .40 with 
English marks in the first year of high school. Is this good evidence of 
validity? Discuss. z 

9. Suppose that the Metropolitan Achievement Tests have been admin- 
istered in grade 5 in October. How might you, as the teacher, use the 
results of the test? 

10. For what predictive purposes would it be desirable to have the 
results from the following tests: 

(1) A test of ability to read difficult scientific prose drawn from 

various fields. 

(2) A test of skill in grammar: punctuation, capitalization, sentence 

structure, and so on. 

11. How could you use the results from a group intelligence test to 
supplement scores made by your pupils on an achievement battery? 

12. Is it important to have tests of speed, as well as of power? 


CHAPTER 6 


APTITUDE TESTS 


f raits and abilities which enable 
him to speak French readily, acquire mathematics, deal handily 
with tools, or play a musical instrument V sell, he is said to have 
aptitude for the given activity- Aptitudes are probably inherited 
basically, but they cannot appear unless the environment is 
favorable—that is, unless the opportunity is provided. Very often 
some training, often a great deal of it, is necessary, too, before an 
aptitude reveals itself in performance. 

Aptitude tests are not essentially different in form or in con- 
tent from intelligence and educational achievement tests, since 
all mental tests are in reality measures of aptitude. Intelligence 


When a youngster possesses t 


131 


132 Aptitude Tests 


tests measure capacity for school work and for vocations requir- 
ing school training; and achievement tests measure proficiency 
in English grammar, mathematics, science, and other subjects. 
Perhaps the chief difference between these tests and those de- 
signed to measure aptitudes is the fact that an aptitude test is 
concerned almost entirely with the future—with prognosis. Thus 
an engineering aptitude test is used typically to forecast an ex- 
aminee’s chances of success in engineering. The aptitude test 
alone is, of course, rarely able to provide a wholly satisfactory 
estimate of probable perfermance later on. For an individual’s 
efforts to be maximally effective, aptitude must be supple- 
mented by training. Furthermore, the examinee must possess 
initiative, interest in the job, and favorable personality charac- 
teristics. 

We have classified aptitude tests under four heads: (1) general, 
(2) special, (3) professional, and (4) talent. The two best-known 
general aptitude batteries are those designed to assess aptitude 
for (a) mechanical tasks, and (b) for clerical work. Many special 
tests (of speed, co-ordination and reaction time) have been de- 
vised to measure aptitudes believed to be crucial in industry. 
Achievement tests, too, are employed as aptitude tests to reveal 
an examinee’s performance in languages or mathematics, for 
example, and hence provide a measure of his promise in more ad- 
vanced courses. In the field of professional work, aptitude test 
batteries have been assembled to assess the traits believed nec- 
essary for success in medical school, in law, in engineering 
and in teaching. Aptitude in music and art is generally called 
talent, and tests are available to forecast achievement in these 


fields. 
GENERAL APTITUDE BATTERIES 


The general aptitude battery attempts to forecast probable 
success in a number of related tasks or vocations by sampling @ 
wide range of behaviors believed to be involved in the activity- 
In this section, two batteries designed to measure aptitude for 


Mechanical Aptitude 133 


mechanical work are described, together with two batteries 
planned to measure aptitude for clerical proficiency. 


Mechanical Aptitude 


. The term “mechanical aptitude” includes a variety of behav- 
iors. One of the earliest mechanical aptitude tests consisted of 
a box containing a number of commot. gadgets in separate com- 
partments. Each of these contrivances (a lock, door bell, clothes 
pin, and so on) was to be assembled with the aid of simple tools. 
The score was determined by the spéèd and accuracy of assem- 
bly. This kind of test is often described as a “job sample” or 
“vocational miniature,” since it involves what has to be done on 
a small scale. Among the sub-tests in paper-and-pencil batteries 
devised to measure mechanical aptitude are (1) tests requiring 
motor speed and dexterity of movement, (2) tests of the ability 
to visualize or perceive mechanical and spatial relations (im- 
portant in reading blueprints and in architectural drawings); (3) 
tests of mechanical information concerning tools, machines, and 
the construction and use of various contrivances; and (4) tests 
of mechanical reasoning as demonstrated in the ability to solve 
problems dealing with tools, pulleys, levers, machine parts, and 
the like. In addition, in assessing mechanical aptitude, inventories 
are used which are designed to reveal interest in mechanical 
things. Such interest may be shown, for example, when a boy 
reads Popular Science avidly, has his own tools, tinkers with 
radios, and builds space machines. One of the most useful find- 
ings to come out of the testing program in World War II was 
the discovery that paper-and-pencil tests of mechanical aptitude 
are as predictive of success in many mechanical jobs as are actual 
job samples covering the work. 

The following two test batteries are representative of the best 


tests in this field: 


MacQuarrie Test of Mechanical Ability 
Bennett Mechanical Comprehension Test 


134 ; Aptitude Tests 


MacQuarrie Test of Mechanical Ability* 


Description. This battery consists of seven paper-and-pencil 
tests, as follows: 


1. Tracing: following a narrow path. 

2. Tapping: making dots rapidly. 

3. Dotting: placing dots precisely. 

4. Copying: making a figure from co-ordinates. 

- Location: locating items by co-ordinates. 

Block Counting: counting hidden blocks in a stack. 
- Pursuit: tracing a line through a tangled pattern. 


Sample items from the MacQuarrie tests are shown in Figure 6-1. 

All these tests are relatively simple and all are speeded: testing 
times are short. The MacQuarrie tests are designed to measure 
hand-eye co-ordination, finger movement and speed, manual dex- 
terity, visual acuity, and spatial perception of direction and size. 
Taken as a whole, the MacQuarrie battery measures motor dex- 
terity as a fairly low level of difficulty rather than aptitude for 
engineering or for architecture. For the latter, the Bennett Test 
of mechanical comprehension is recommended. Some of the 
MacQuarrie sub-tests are predictive of special tasks: the tests in 
tracing, dotting and pursuit, for example, measure aptitude for 
typing; and tests of block counting, tracing, pursuit, location, and 
copying are related to performance in mechanical drawing and 
the reading of blueprints. The Manual which accompanies the 
MacQuarrie advises the use of sub-test patterns for predicting 
success in various jobs. 


Scope. The MacQuarrie test can be administered from grade 7 
on. It has been employed chiefly in the prediction of success 1 
factory and other manual-manipulative work. 


Scoring. Percentile norms are available for the sub-tests and for 
total score. The working time for the whole test is about twenty 
minutes. Since some of the tests in the battery are allotted only 


* Published by the California Test Bureau, Los Angeles, Calif. 


Bennett Mechanical Comprehension Test 135 


FIGURE 6-1 Sample Items from the MacQuarrie Test of 
Mechanical Ability 


Blocks: How mony blocks touch each 


Copying: Copy figure by joining dots. 
block with an X on it? 


Pursuit: Follow each line by eye ond show where it ends, by 
writing its number in the correct box at the right. 


permission of the California Test Bureau. 


Reproduced by 


atch is necded in order to time 


ten to twenty seconds, a stop W 
ity of the whole test is high. 


the tests accurately. The reliabil 
Reliabilities of the seven sub-tests are lower, but are fairly satis- 


factory for such short tests. 


Bennett Mechanical Comprehension Test* 


Description. This is a paper-and-pencil test in which compre- 


hension of mechanical relations is determined by means of pic- 


* Published by The Psychological Corporation, New York, N. Y. 


136 Aptitude Tests 


tures and sketches. The test is fairly advanced in difficulty. Each 
picture or drawing has a simply phrased question designed to 
reveal the examinee’s understanding of the mechanical problem 
presented. Figure 6-2 shows samples from the test battery. 


Scope. There are four forms of the Bennett test. Form AA, the 
easiest, is suitable for trade and high schools and for less well 
trained workers. Form “BB, more difficult, is for engineering 


FIGURE 6-2 Samples from Bennett Mechanical Comprehension 
Test 


Which room has more of an echo? 


Which would be better shears 
for cutting metal? 


o ‘oe 
nwt” C 


Which gear turns slower? 


Which cart is more likely to tip 
over on the hillside? 


Reproduced by permission of The Psychological Corporation. 


Minnesota Clerical Test 137 


school applicants, technicians, and engineers. Form CC, the most 
difficult, differentiates among examinees of high ability levels. 
The fourth form, WI, is for women. 


Scoring. Percentile norms, which are supplied for each test 
form, are applicable to a variety of student and occupational 
groups. The test is valuable in guidance, in selecting applicants 
with aptitude for mechanical thinking} and in the selection of 
students wanting to study mechanics and engineering. The Mac- 
Quarrie is a useful supplement to the Bennett test when speed 
and manual dexterity are required as w ell as more abstract think- 
ing about mechanical relations. 

The reliability of the Bennett is satisfactory. Validity is hard 
to determine, but the test is valid in relation to such criteria as 
grades in high-school shop courses and occupational and in- 


dustrial performance. 


Clerical Aptitude 


Tests planned to gauge clerical aptitude are concerned mainly 


with perceptual speed and accuracy in reading, writing, and 
marking, and with manual dexterity and skill. Office workers 
are designated in several ways, such as general clerk, sales clerk, 
shipping clerk, filing clerk, typist, and receptionist. The jobs 
differ in the kind and variety of their duties, but all demand (to 
a greater or lesser extent) reading, writing, sorting, checking, 


filing, folding, sealing, and stamping. 
The present section will describe two tests of clerical apti- 


tude, the first fairly narrow in functions covered, the second 


much broader. 
Minnesota Clerical Test 
General Clerical Test 


Minnesota Clerical Test* 
Description. This battery covers speed and accuracy in per- 
* Published by The Psychological Corporation, New York, N. Y. 


138 Aptitude Tests 


ceiving clerical detail. There are two parts, number comparison 
and name comparison. In the first, the examinee is shown two 
hundred pairs of numbers each containing from 3 to 12 digits. 
If the two numbers are alike, the examinee places a check (Vv) 
between them; if they are unlike, he leaves the space blank. In the 
second test, proper names (which match or fail to match) are 
substituted for number pairs. Samples are shown below: 


79542 79524 


5794367 V 5794367 
John C. Linder John C. Lender 


Investors’ Syndicate Investors’ Syndicate 


V 


The Minnesota Clerical Test is not designed to encompass all 
the factors which make for proficiency in office work, but it 
does attempt to predict ability to handle addresses, bills, accounts, 
and so on. The Minnesota test has been found to have prognostic 
value in the selection of clerks, packers, checkers, inspectors (of 
products), and other factory jobs. 


Scope. This clerical test may be used with students from junior 
high school on and for adults. 


Scoring. The working time of the test is about fifteen minutes, 
so that both speed and accuracy enter into a score. Individual 
differences appear in the scores and must be taken into account 
in interpreting the test. A very careful examinee, for example, 
may make few errors but earn a relatively low score because 0 
slowness and over-cautiousness. On the other hand, a fast but 
careless worker may mark more items but tend to make many 
errors. Percentile norms are available for boys and girls, junior 
and senior high-school students, and several groups of industrial 
workers. Among the latter there are norms for women who are 
machine operators, typists and clerks; for men who are tellers 
(bank), accountants and various sorts of clerks. A high score 
earned by a student does not necessarily mean that this examinee 


General Clerical Test (GCT) 139 


will make a good clerical worker, though it is a decidedly good 
omen. On the other hand, a high-school counselor would cer- 
tainly be wise to question the vocational promise of a com- 
mercial and business student who scored below the twenty-fifth 
percentile of clerical workers. The reliability of the test is high. 


General Clerical Test (GCT)* A 


Description. This test battery is designed to measure three kinds 
of aptitude judged to be valuable in office work. There are nine 
sub-tests in the battery. Parts I and H test clerical speed and 
accuracy; Parts III, IV, and V numerical ability; Parts VI, VII, 
VIII and IX verbal facility. The first two (checking and alpha- 
betizing) measure perceptual speed and accuracy as expressed in 
such activities as sorting, coding, and alphabetizing. The next 
three measure numerical aptitude as shown in computation, error 
location and arithmetical reasoning. The last four measure verbal 
facility by means of spelling, reading, comprehension, vocabu- 
lary, and grammar. The over-all score is a good measure of 
abstract intelligence, as well as of aptitude for clerical work. 
The test is to be recommended, therefore, for clerical jobs which 
demand a relatively high level of intelligence. à 


ntėnded for use with high-school and 
he GCT may also be valuable when 
clerical positions. The 
inutes. 


Scope. The battery is i 
business school students. T 
testing applicants for more responsible 
Working time for the test is about fifty m 


Scoring. Percentile norms are available for high schools and for 
business schools, as well as for various sorts of clerical workers. 
Norms for each sub-test, as well as for total score, are provided. 
The reliability of the whole test is high—greater than .90. The 
reliability of the sub-tests is much lower, and the counselor must 
be tentative in judgments based upon parts of the test. 


* Published by The Psychological Corporation, New York, N.Y. 


140 Aptitude Tests 


APTITUDE TESTS IN SPECIAL AREAS* 


In this section five test batteries often useful to the educational 
counselor and classroom teacher will be described. These spe- 
cialized examinations are illustrative of many tests in this field: 


Differential Aptitude Tests 

Minnesota Paper Form Board 

Murphy-Durrell Diagnostic Reading Readiness Test 
Orleans Algebra Prognosis Test 

Turse Short-Hand Aptitude Test 


Differential Aptitude Test (DAT)** 


Description. This battery is designed for educational and voca- 
tional guidance of high-school students. There are seven sub- 
tests, each of which yields a separate score: 


Verbal reasoning: A difficult verbal analogies test, which measures ay 
to handle verbal relations. Aspirants for professions should earn high 
scores. 

Numerical ability: An arithmetic test covering a wide range of opera- 
tions. This test is an important predictor in science and ey 

Abstract reasoning: A non-language test which demands the solution a 
problems expressed in diagrams and figures. The test measures a hig 
level of abstract intelligence. a 

Space relations: Ability to perceive a three-dimensional object from * 
two-dimensional pattern. Useful in engineering, architecture, a” 
drafting. P : j 

Clerical speed and accuracy: A test of speed and accuracy in the pe 
formance of clerical tasks. Speed is an important factor. Son 

Mechanical reasoning: A form of the Bennett Mechanical Comprehensi° 


* Under aptitude tests are often listed sensory-motor tests of visual and 
tory keenness, as well as special tests of motor skills, dexterity and ea’ 
Apparatus tests of this sort are valuable in industry and the military ` here: 
but they are not used routinely in the schools and will not be describe 1 the 
Some of the devices are very complex and require specialized training = a 
part of the examiner. Oral Trade Tests constitute another sort of Spee er 
aptitude test which will not be treated here. These tests are really ora woo! 
views, are administered individually, and are valuable in appraising the 
tional training and work experience of an applicant. 7 E 

** Published by The Psychological Corporation, New York, N. Y. 


Differential Aptitude Test (DAT) 141 


Test. Useful as a predictor of engineering aptitude when combined 
with the first four tests above. 

Language usage: Two tests scored separately which measure the ability 
to spell and to locate errors in sentences. Emphasizes the mechanics 
of language as compared with test #1, which emphasizes abstract 
comprehension. 


FIGURE 6-3 Sample Items from the Differential Aptitude Tests 


MECHANICAL REASONING 
Which man in this picture hos the heavier load? 


In each test item, one of the five combinations is underlined. 
Find the same combination on the answer sheet and mark it. 


LANGUAGE USAGE: / Spelling 
Indicate whether each word is spelled right or wrong. 


EXAMPLES SAMPLE OF ANSWER SHEET 


W. man 


x. gurl 


LANGUAGE USAGE: II Sentences 
Decide which of the lettered parts of each sentence contains errors, 
if any, mark the corresponding letters on the answer sheet. 
Ain't we / going to the / office / next week / at all. 
A B c D E 


Reproduced by permission of The Psychological Corporation. 


142 Aptitude Tests 


The illustrative items in Figure 6-3 show the nature of the 
sub-tests. ž 

A main feature of the DAT is that the total score is broken 
down into several components, so that from a student's profile 
we have a record of comparative performance in eight funda- 
mental activities. The Manual gives explicit instructions for ad- 
ministering and scoring he test battery. In addition, a Casebook 
illustrates the use of the profile in diagnosis, and will be helpful 
to guidance counselors. 


Scope. For grade 8 anu for high-school grades 9-12. 


Scoring. Percentile norms are supplied for grades 8 through 12 
for total score and for scores on each sub-test. Since there are 
large sex differences, percentile norms are given for boys and 
girls separately. Scaled scores (with a mean set at 50) are em- 
ployed in plotting the profiles. Figure 6-4 shows the profile of 
a boy who could profit from educational counseling. Note that 
James is high in the space and mechanical tests, but mediocre to 
low in all the others. The boy is certainly not “verbally minded,” 
although he appears to have real talent in mechanics. The teacher 
will understand James better if he has his profile available. 

The DAT represents the modern practice of substituting 4 
number of analytic scores (for example, on a profile) for a single 
over-all score. We have noted (page 113) that diagnosis of 
strong and weak points from short sub-tests is always precarious 
because of their low reliability. The reliability of the total DAT 
is very high, and the authors have increased the value of a diag- 
nosis from the sub-tests by computing the minimal difference 
between sub-test scores which will be significant, that is, por 
chance. This makes it possible to say, for instance, that Roy $ 
score in abstract reasoning is significantly higher than his score 1? 
clerical speed and accuracy, or that Betty’s scores in verba 
reasoning and numerical ability do not differ significantly. 

Despite its general excellence, the DAT has some practica 
drawbacks to its use in schools. For one thing, the battery !$ 


FIGURE 6-4 


INDIVIDUAL 
REPORT 
FORM 


naur 


JAMES NEMCOMER 


PLACE or TESTING 
MIDTOWN HIGH SCHOOL 


Differential Aptitude Test (DAT) 143 


Profile of a High-School Boy on the Differential 
Aptitude Tests 


DIFFERENTIAL APTITUDE TESTS 


G. K. Bennet, H. G. Seashore, and A. G. Wesman 


THE PSYCHOLOGICAL CORPORATION 
New York 18, N. Y. 


heke 23 


30 


Standard 


Reproduced by 


permission of The Psychological Corporation. 


long (working time approximately three hours) and the cost 
relatively high. Good norms are available for the high-school 
grades (boys and girls taken separately), but there are relatively 
few data on occupational and vocational groups. The battery 
appears to have content validity, and various experimental studies 
indicate that it possesses empirical validity. For example, workers 
m the electrical, mechanical, and building trades score above 
average on mechanical reasoning, and clerks are about average 
1n numerical ability and in clerical speed and accuracy and lan- 


144 Aptitude Tests 


guage usage. Engineering students score very high on all the sub- 
tests except the clerical tests, but are above the mean here. Men 
in the skilled trades (baker, butcher) are average in mechanical 
reasoning, and low in numerical ability, abstract reasoning, and 
space relations. Pre-medical students score high on all sub-tests, 
and especially high in verbal reasoning, numerical ability, and 
sentences. In the high school, verbal reasoning and sentences are 
predictive of grades in English; numerical ability, verbal reason- 
ing, and abstract reasoning show substantial correlations with 
mathematics and science. Unfortunately, the data do not reveal 
how successful a man is likely to be over a period of time in 
a profession, occupation, or trade. But the tests often provide 
significant clues. 


Minnesota Paper Form Board (MPFB)* 


Description. This is a well-known paper-and-pencil test dealing 
with spatial relations. It represents an effort to put a formboard 
on paper. Sample items are shown in Figure 6-5. 

Each test item presents a geometrical figure cut into two OF 
more parts. The examinee is to decide how the parts would look if 
fitted into a complete figure; he does this by selecting the draw- 
ing which shows the correct arrangement. Studies have shown 
the Minnesota Paper Form Board to be a good index of ability 
to perceive spatial relations and to manipulate figures in two 
dimensions. The test is useful as an aid in predicting success in 
shop work, grades in technical courses, in dentistry, art work, 
and shop and factory output. It does not tap the more intellectual 
aspects of engineering—for instance, the ability to use symbols 
in solving problems. But it does test one component in engineer 
ing skill. A boy scoring high in the MPFB is not necessarily apt 
in engineering, dentistry, or art, but he has promise and is war 
further examination. On the other hand, a boy who scores low 
had best be encouraged to try some other kind of work. As 
often happens, we can give negative educational and sega 
advice with far greater assurance than we can offer a positiv 

* Published by The Psychological Corporation, New York, N. Y- 


Murphy-Durrell Diagnostic Reading Readiness Test 145 


FIGURE 6-5 Sample Items from the Minnesota Paper Form 
Board Test 


the figure which would result if the 
pieces in the first section were assembled. 
hological Corporation. 


Diréctions: For each item choose 
Reproduced by permission of The Psyc 


we can tell a youngster that he had 


course of action. Thus, 
ways offer him 


better not attempt engineering, but we cannot al 
specific advice as to just what he should do. 


Scope. Grade 7 and above. 

able for school grades and for various 
are two forms of the MPFB, and 
d score. Counselors have found 
rbal intelligence and achieve- 
nning to study architecture, 
r vocations requiring spatial 


Scoring. Norms are avail 
Occupational groups. There 
the test is easy to administer an 
the test useful as a supplement to ve 
ment tests, especially for students pla 
engineering, commercial art, and othe 
perception and visualization. 
Murphy-Durrell Diagnostic Reading Readiness Test* 

Description. This test has been designed to measure three char- 
acteristics believed to be important in the acquisition of reading 
skills: auditory discrimination, visual discrimination, and learn- 


* Published by the World Book Company, Yonkers, N. Y. 


146 Aptitude Tests 


ing rate. Like other readiness tests it is prognostic, in that it 
forecasts whether or not a child is ready to begin reading. It is 
also an achievement test and could be so classified, as it measures 
the educational maturity of a youngster. The Metropolitan Read- 
iness Tests (page 118) may, in turn, be classified as aptitude tests 
rather than as achievement tests. 

The Murphy-Durrell ‘test provides useful information for the 
first-grade teacher in deciding when to start a formal reading 
program and what outcomes to expect. At the same time, a good 
intelligence test will be useful in estimating general mental 
maturity. 


Scope. Early in the First Grade or before. 


Scoring. Test 1 and Test 2 (auditory discrimination and visual 
discrimination) require about an hour each. Test 3 (learning) 35 
both an individual and a group test; there are twenty minutes for 
group instruction and three brief individual periods. Obtained 


raw scores are converted into percentile norms for Tests 1 and 2; 
ratings are used in Test 3. 


Orleans Algebra Prognosis Test (Rev.)* 


Description. This is a prognostic test, the purpose of which is tO 
determine whether a pupil is likely to succeed in (is ready for) 
algebra. The test is administered before the pupil undertakes the 
study of algebra. There are nine parts, consisting of simple 
lessons covering some aspect of algebra—for example, use ie 
symbols, substitution in equations, literal nomenclature, gi 
solving of problems, followed by tests on the material presented- 
An arithmetic test and a summary test of the material are 10- 
cluded. The test has been shown to have good prognostic value, 
as indicated by its correlations with algebra grades and achieve- 
ment test scores in algebra. 


Scope. For students planning to study algebra. 
* Published by the World Book Company, Yonkers, N. Y. 


Turse Shorthand Aptitude Test 147 


Scoring. The test requires about forty-five to fifty minutes. 
There are percentile norms corresponding to point scores. Fur- 
ther, there are expectancy charts predicting how well a child 
making a certain score can be expected to do in algebra. The 
reliability of the test is satisfactory. 


Turse Shorthand Aptitude Test* 


Description. This test is illustrative of aptitude tests developed 
for use with commercial and vocational subjects. The purpose of 
the test is to determine whether an €xaminee is likely to be 
Successful in learning shorthand. There are seven sub-tests: strok- 
ing, spelling, phonetic association, symbol transcription, word 
discrimination, dictation, and word sense. 


Scope. For students planning to study shorthand. 


Scoring. There are percentile norms for students beginning the 
study of shorthand. The Turse test is correlated with achieve- 
ment in shorthand and is valuable in prognosis. The working 
time of the test is about an hour. 


APTITUDE TESTS FOR THE PROFESSIONS 


Tests of aptitude for the professions are primarily achievement 
tests designed to forecast a student’s chances of success in train- 
ing for medicine, law, or engineering. These tests are specialized 
in content and are essentially work samples in the designated 
field. Professional aptitude batteries are validated against grades 
In courses. It is not known precisely just how predictive these 
tests are of success in the actual practice of a profession, but 
there is some evidence that such aptitude tests are related—some- 
times highly related—to later success. 

The classroom teacher should be familiar with the general 
Content and purpose of the professional aptitude tests, though 
he will rarely be called on to administer or score them. These 


* Published by the World Book Company, Yonkers, N. Y. 


148 Aptitude Tests 


batteries are not generally available, are often part of a testing 
program, are highly specialized, and are usually scored and 
interpreted in a testing center. We shall, accordingly, give a less 
detailed description of them. 


Medical College Admission Test* 


Description. This test consists of four parts: verbal, quantitative, 
understanding modern society, and science. The verbal section 
includes tests of vocabulary, and reading comprehension tests 
in science, social studies, and the humanities, The quantitative 
part requires that the examince solve problems making use of 
numbers and symbols. The “understanding society” section 18 4 
multiple-choice examination covering current social, economic, 
and political affairs. The science part of the test contains ques- 
tions drawn from pre-medical courses in biology, chemistry, and 
physics. Samples from the various sections reveal the character 
of the examination. 


Verbal section 
sporadic: (A) immediate, (B) regular, (C) occasional, (D) alter- 
nate. (E) replete 


Quantitative part ý 
12. One-fifth of a batch of 2000 radio tubes were defective. If one- 
fourth of the first 1000 were defective, what fraction of the 
second 1000 were defective? 
(A) 1/20 (B) 1/10 (C) 3/20 (D) 9/40 (E) 3/10 


Understanding society 

18. Which of the following was the primary objective of the 
nations which signed the North Atlantic Pact? 
(A) To form an alliance for military conquest. 
(B) To insure economic stability in democratic states. 
(C) To replace the Marshall Plan with a new alliance. ; 
(D) To destroy the effectiveness of the Soviet veto in the 

United Nations. 

(E) To unite for collective defense. 


* Published by the Educational Testing Service, Princeton, N. J. 


© 
Law School Admission Test 


Science 
21. A sodium atom and a sodium ion 
(A) contain the same number of electrons 
(B) contain the same number of protons 
(C) have the same chemical properties 
(D) have the same physical properties 
(E) have different atomic numbers 


149 


The first three parts of the battery are related directly to 
standing in medical school. The “understanding society” section 
is not related to medical knowledge, but is included in an 
attempt to select candidates for medicinz who will be successful 
in adapting to the needs of the time. In their instructions to candi- 
dates, the authors write that “the test is intended to complement 
other data (your total college record, interviews, references and 
recommendations) with an objective inventory of your skills, 


concepts and information . » - acquire 
from experience.”* 


Law School Admission Test** 


d from formal study and 


Description. This battery is designed for use in selecting the 
best candidates from among those applying for law school. The 


battery has six parts: principles and cases, 
reading comprehension, debates 
whether a statement supports, 
resolution), best arguments, and p 


data interpretation, 
(the examinee determines 
refutes, or is irrelevant to a given 
aragraph reading. Some of the 


material is difficult. The test battery has a correlation of about .50 
with law school grades. When combined with college marks, 


it is highly predictive of success in law school. 


Pre-Engineering Ability Test** 


Description. This test consists of two sorts of material: (a) 


comprehension of scientific materials, 


* Medical College Admissio 
esting Service, Princeton, N. J 1957, p. 22. 


** Published by the Educational Testing Service, Princeton, N. J. 


and (b) general mathe- 


n Test, Bulletin of Information, Educational 


150 Aptitude Tests 


matical problems designed to measure competence in this area. 
The first part of the test involves reading scientific prose, tables 
and graphs, and answering questions based upon these materials. 
The second part consists of problems in arithmetic, algebra, and 
geometry. The Pre-Engineering Test correlates about .50 with 
grades in the first term of engineering school. The reliability 
of the battery is high. 


National Teacher Examination 


Description. These examinations are planned for use by school 
systems as an aid in the selection of teachers, and they are em- 
ployed also by teacher-training colleges as a means of evaluating 
their students. The examinations are constructed, administered, 
and scored by the Educational Testing Service. Their objective 
is the measurement of professional background, general intelli- 
gence, and general culture. There are two parts of the battery» 
four common examinations, and a series of optional examinations. 
The first set covers a student’s general background for teaching; 
and the second his mastery of some special field. 

The common examinations comprise the following sub-tests: 
Professional information: Child development, educational psychology» 

guidance, measurement, principles and methods of teaching. A 
General culture: Sections on science and mathematics and on literature, 

history and the fine arts. Examinations cover the development an 

current state of affairs in these fields. 

English expression: Grammatical errors to be detected in sentences e 
Non-verbal reasoning: A pattern completion test in which the oomi 
must fathom the relationships in a given figure and choose the corre 

figure to complete the pattern. 

The optional examinations cover eight areas of specialization: 
education in elementary schools, early child education, biologi 
sciences, English, industrial arts, mathematics, physical science ? 
and social studies. The four common examinations have esib 
substantial relationships with ratings for effectiveness of i 
by supervisors. The tests do not attempt to measure persona”! y 
factors, interest, or drive. 


EBuspajd əsow ays sı suo|pyuasaid Om} Əy} JO YPIYM 


Seashore Measures of Musical Talents 151 


TESTS OF ARTISTIC APTITUDE OR TALENT 


Tests in this area are concerned with finding whether an 
examinee possesses some of the factors which appear to be neces- 
sary for success in music or in art. So many traits contribute to 
the success of an artist or a musician that it is impossible for an 
aptitude test to do more than tap some of the more obvious com- 
ponents. Perhaps the best the aptitude tests can do in many in- 
stances is to aid the counselor in steering away from the arts those 
aspiring students who have no real talent and whose money and 
time might be better spent in other pursuits. 

It is doubtful whether the classroom teacher will have the 
time, the training, or the equipment needed to administer and 
interpret the aptitude tests in this area. Teachers engaged in 
guidance should be familiar with such tests, however—with what 
they are and what they are trying to do. Two tests of music 
and one of art will be described in this section. They are repre- 
sentative of aptitude measures in this field. 


Seashore Measures of Musical Talents 
Diagnostic Tests of Achievement in Music 


Meier Art Judgment Test 


Seashore Measures of Musical Talents* 
Description. This is a test of “ear for music.” The test battery 


consists of six separate tests covering such attributes of tone as 


pitch discrimination, loudness, rhythm, time, timbre, and tonal 


memory. The tests are given by means of phonograph records. 
Each test item or problem presents a pair of tones or a tonal 
sequence. In the second playing, one of the tones is changed, 
or the sequence of tones is altered in some way. In the pitch 
discrimination test, the examinee marks on a test sheet whether 
the second tone is higher (H) or lower (L) than the first. Com- 
parisons become progressively more difficult as the difference 
in pitch between the two tones decreases. In the time and loud- 


* Published by The Psychological Corporation, New York, N. Y. 


152 Aptitude Tests 


ness tests, the second, or comparison, tone differs in strength 
or in pattern from the first. The rhythm test requires the exam- 
inee to decide whether the second of two patterns is alike or 
different from the first. The timbre and tonal memory tests 
differ somewhat from the others. In the first, two tonal patterns 
are compared for quality (consonance); in the second, a short 
series of from three to five tones is played, and then played a 
second time with one note changed. The subject must write 
down the number of the altered note. The stimuli (tones) pre- 
sented by the phonograph records are as pure (uncomplicated) 
as possible. 


Scope. The Seashore Tests are applicable from the fifth grade 
on. 


Scoring. Scores from the six sub-tests are plotted on a profile 
to give a graphic representation of performance. Percentile norms 
are available for fifth and sixth graders, seventh and eighth grad- 
ers, and adults. The Seashore Tests have been used in schools of 
music and in music courses in academic schools. The tests ad- 
mittedly do not run the gamut of musical talent, but they do 
measure important aspects of musical aptitude. A child who Der 
low on these tests has a poor ear for music and is a doubtfu 
selection for extensive musical training. The reliability of the 
battery runs about .80. 


Diagnostic Tests of Achievement in Music* 


Description. As the name implies, this test battery is designed 
to find how well students have acquired the theory and a 
nical knowledge needed to read and understand music. an 
test consists of ten parts: diatonic syllable names, a“ 
syllable names, number names, time signatures, major and aa 
keys, note and rest values, letter names, signs and symbols, dal 
names and song recognition. Test content is based on mater! 


* Published by the California Test Bureau, Los Angeles, Calif. 


Meier Art Judgment Test 153 


recommended by musical authorities as fundamental in musical 
education. A piano is required for the tests. 


Scope. For grades 4 through 12. Test items are graded up 
sharply in difficulty. 

Scoring. Norms for the test are based on the degree of mastery 
shown by students for the various sorts of material. Strengths and 
weaknesses are revealed by comparison of scores upon the ten 
parts of the test. The reliability of the whole test over the rather 
wide range for which it is applicable is sery high. Reliability for 
separate grades is lower, but probably satisfactory. Working 


time for the test is about sixty minutes. The Diagnostic Tests 
ts like the Seashore. An ear 


are a useful supplement to “ear” tes 
l activity, whereas a knowl- 


for music is necessary for any musica 
edge of the technical aspects of music is necessary for one 


aspiring to be a musician. 


Meier Art Judgment Test* 


Description. This test consists of a hundred problems in each of 
which an artistic judgment is demanded. Each test item is pre- 
sented in two versions. In the first version, there is a painting or 
drawing by some well-known artist, or an acknowledged artistic 
design; in the second, the same theme is presented but in altered 
form, the change being in symmetry, balance, unity, or rhythm. 
All pictures are in black and white, so that no complication is 
introduced (nor any clues) by color. The examinee is told that 
the two versions of the picture differ and is asked to select the 
better version. The test is, accordingly, a measure of aesthetic 
judgment, the criterion being the consensus of experts in art. 
See Figure 6-6 (facing page 150). 

Scope. The Meier test is intended for junior and senior high 
schools, as well as colleges and art schools. 
nal Research and Service, University of 


* Published by the Bureau of Educatio 
Iowa, Iowa City, Iowa. 


154 Aptitude Tests 


Scoring. Norms are for students in art courses. High scores do 
not necessarily mean that the student is destined to be an artist. 
But a low score should be a warning signal to one planning a 
career in art. The Meier test has correlations of from about .45 
to .50 with grades in art courses, but low correlation with scores 
on verbal intelligence tests. This does not mean that artists are 
unintelligent, but that many factors besides abstract ability must 
enter into artistic appreciation. The reliability of the test 1$ 
about .75 for fairly homogeneous groups. 


HOW TO JUDGE AN APTITUDE TEST 


Like tests of intelligence and of educational achievement, apti- 
tude tests must be judged by the adequacy of their validation, 
reliability, scaling, and norms. Various comments concerning 
these aspects of the tests described in this chapter have been made 


in appropriate places. This and other material will now be 
summarized. 


Validity. Aptitude tests generally possess content validity. Tests 
of speed, dexterity, seeing mechanical relations, solving ene 
cal problems, and the like seem proper for measuring mechanica 
aptitude. Moreover, tests of sorting, writing, reading, and alpha 
betizing appear to be appropriate for assessing clerical aie 
In the tests of professional aptitude and in those of talent, ™ 
content has been chosen with a view toward forecasting pet 
formance in school and (hopefully) in life. ere 

Aptitude tests have been validated against various Criteria, : 
cluding grades in courses and success in vocations or trades. ye 
measures of practical or working validity have been ere 
for the Minnesota Paper Form Board, the MacQuarrie and t 


DAT. 


Reliability. The reliability of the standard aptitude oe 
generally satisfactory, and we can have confidence in the stabi E 
of a score. In some cases, the standard error of a score, as W 
as the reliability coefficient, is given by the author of a test. 


Meier Art Judgment Test 155 


Scaling. In most aptitude tests, raw scores are converted into 
percentile ranks. In some tests (the DAT, for example) scaled 
scores are used on the profile in making comparisons of a given 
student’s scores. 

Norms. All aptitude tests have norms either in percentiles or in 
scaled scores. A few tests give norms for certain occupational 
groups. One drawback to the use of vocational aptitude tests, 
however, is the lack of adequate norms in many job areas. There 
is a need for information regarding the predictive value of pro- 
fessional and vocational tests for persons long out of school. 
It would be a great advance if we knew how well an aptitude 
test could forescast the success of engineers or lawyers, for ex- 
ample, and not simply grades in courses. 


SUGGESTIONS FOR READINGS 


sting. New York: Macmillan, 1954. 


Anastasi, A. Psychological Te. 
sting. New York: 


Cronbach, L. J. Essentials of Psychological Te 
Harper, 1949. 

Greene, E. B. Measuremen 
York: Odyssey Press, 1952. 

Noll, V. H. Introduction to 
Houghton Mifflin, 1957. 


SUGGESTIONS FOR LABORATORY WORK 


] standardized aptitude tests to the class. Cut the 
should score their own tests and 


nts of Human Behavior (Rev. edition). New 


Educational Measurement. Boston: 


_ 1, Administer severa 
time allowance if necessary. Students 


plot profiles when called for. 
2. Find in the Manual the specifications which the author lays down 


for his aptitude test. Examine the items of the test. Do you agree that 
the test has content validity? Are any data given on experimental 
validity? 

3. Make a study of the 
for validity, reliability, scaling procedur 


QUESTIONS FOR DISCUSSION 


diness tests in purpose and 


Differential Aptitude Tests. Analyze the battery 
es, and norms. 


1. How do aptitude tests differ from rea! 
content? 


156 Aptitude Tests 


2. Why does a test battery like the Minnesota Clerical Test vary 
greatly in the accuracy of its forecasts of office work? ; 

3. Why are aptitude tests used more often in high schools than in 
elementary schools? . 

4. Give some reasons why paper-and-pencil tests of mechanical 
ability are as useful as are work samples in determining aptitude. 

5. How would you set up a program for selecting candidates for a 
nursing school? Outline the procedures you would use. 

6. Would you use the Bennett Mechanical Comprehension Test to 
select workers in an automobile factory? : 

7. It has been said that the best measure of aptitude for mathematics 
(or for any subject) is the achievement to date. Do you agree? . 

8. How could you discover whether the Meier Art Test is measuring 
native artistic ability and not training in art? 

9. A girl of 16 scores very high on the Seashore Music Test. Would 
you advise her to undertake a carcer in music? Why or why not? What 
else might you need to know about her? 

10. How can a follow-up study of graduates of law and medical schools 
be useful to a counselor using a professional aptitude test? 


CHAPTER 7 


PERSONALITY TESTS 


In previous chapters, we have indicated on several occasions 
that prediction of success based upon measures of intelligence, 
school achievement and aptitude must always be qualified by the 
statement “provided the personality traits are favorable.” In the 
present chapter, we shall attempt to see how well we can deter- 
mine favorable and unfavorable personality traits. 

There are a number of descriptions of personality, and the 
usefulness of any definition will depend in most cases upon the 
purposes of the author. For the teacher or school counselor, a 
practical working definition is to the effect that personality isa 
student’s characteristic way of doing things. Suppose that two 
boys, John and Jim, are about the same age, have about the same 


157 


158 Personality Tests 


IQ, and do about the same caliber of school work. But suppose 
further that John is friendly, highly motivated, and likeable; that 
Jim is sullen, indifferent in attitude, and generally avoided by 
teachers and classmates. The decided contrast in the behavior of 
these two boys arises from their distinctive personality traits, 
not from their differences in mental ability. Failure in school— 
or in life—is often the result of a person’s inability to make good 
use of his potential personal assets, Obviously, lack of success 
may arise either from negative (unpleasant) personality traits or 
from failure to make use of positive (pleasant) personality traits. 

The psychologist has attempted to evaluate personality traits 
in three ways: (1) by rating scales, (2) by questionnaires and in- 
ventories, and (3) by what are called “projective tests.” The first 
two approaches are the more readily applicable in the schools. 
Projective tests should be employed only by clinical psycholo- 
logists and psychiatrists, since they require special training for 
their administration and interpretation. These “tests” are essen- 
tially clinical instruments and are most often used diagnostically in 
cases of severe personality disorders or in behavior problems where 
drastic emotional disturbances are suspected. Rating scales and 
inventories, on the other hand, can be administered and inter- 
preted in a useful way by teachers and counselors. 


RATING SCALES 


The rating scale is a device for obtaining judgments of u 
degree to which an individual possesses certain behavior traits an i 
attributes not readily detectable by objective tests. In the schoo 
situation, rating scales provide appraisals of a teacher (or of a 
candidate for a teaching position) in several characteristics. Rat- 
ings by teachers or principals are often required for students 
seeking entrance to college or looking for a job. In manne 
rating, the judge expresses his opinion by marking along a grad- 
uated scale or by checking in the category which he feels best 
describes the person being rated. 


FIGURE 7-1 Sample Items from Various Graphic Rating Scales 


1. From a graphic rating scale for clerical workers: 
Accuracy—Consider carefully quality of work, freedom from error. 


RAS ———t ia |) a ee 
no very fi careless many 


few 
errors errors 


errors careful 


2. From a behavior rating scale for children: 
Is his attention sustained? 


Distracted: Difficult Attends Is Able to 

jumps rapidly to keep at adequately. ahsorbed hold 

from one thing a task until in what attention 

to another. completed. he does. for long 

periods, 
(5) (4) (3) (2) 0) 
3. From the American Council on Education Rating Scale for 

Prospective college students: 

Does he get others to do what he wishes? 

Probably Lets others Sometimes Sometimes Displays marked 

unable to take lead. leads in leads in ability to lead 

lead his minor important his fellows; 

fellows., affairs. affairs. makes things go. 


4. From a rating device for teacher candidates: 


(Put a check under the appropriate heading) : 
Tact p inferior | Inferior | Average | Superior kg superior 


5. From a rating scale for teachers: 
(Circle the number which best indicates the degree or extent to 
which the qualities are practiced.) 

O=unsatisfactory; 1=below average; 2=average; 
3=above average; 4=superior. 

Emotional maturity: 

To what extent does the teacher exhibit desirable 0123 4 
balance between emotional responsiveness and emotional 

control? Consider disposition, sense of humor, restraint 

and thoughtfulness in dealing with others, feelings of 

Security, objectivity of interest, freedom from excessive 

fears and worries and warmth of feeling and expression. 


6. From a rating scale for officer candidates: 
Relations with fellow candidates: 
Uncooperative Grudgingly Cooperates Cooperates Leads and 
and willingly cooperates. 


cooperative c 
contributes Good ideas. 


160 Personality Tests 


Perhaps the most useful type of rating device is the Graphic 
Rating Scale or some variation of it. The typical graphic scale 
consists of a straight line, for example, five inches long, which is 
taken to represent the range of behavior in the trait. In licu ofa 
line, several categorics representing gradations in the trait may 
be provided. The illustrations in Figure 7-1 are samples from 
various rating scales, 

Units on the graphic rating scale are represented by the suc- 
cessive scale divisions, but the directions make it clear that the 
check mark indicating the judgment may be placed anywhere 
along the scale line. A graphic scale is often scored by separating 
the scale line into one hundred points. A person’s rating 1s dish 
determined by the distance of the judge’s check from the low 
end of the scale. A more summary method is also used: if there 
are five main divisions on the scale, the highest division may be 
designated “1,” the next division “2,” and so on down. 


“3 


Requirements of a Good Rating Scale 


A good rating scale should satisfy the following a aren 

Traits should be carefully defined. A sentence or a plias i 
more informative than a single word. Thus, the meaning of a> 
initiative,” or “dependability” should be pinned down by 
descriptive phrases or by actual examples. Intelligence, e 
level, personnal appearance, work habits, and the like are Br 
judged than are character traits (loyalty, courage, uresa 
because they are more readily observed in social behavior. pa 
acter traits must be inferred from a variety of behaviors. It 0 on 
helps to clarify a rating if the judge is required to record spec his 
instances (“behaviorgrams”) of the trait which en 
opinion. On the ACE scale, for example, space is left in W 
the judge may provide observations which justify his rating. si 

A good scale avoids terms which are hodgepodges of ae 
of activities—for example, “standing in the community, 4 ecilt 
position” or “moral qualities.” By the same token, a I nows 
avoids narrow, specific terms. The dean or principal rarely k 


Requirements of a Good Rating Scale 161 


intimate details about a teacher—whether he sings in the choir, 
loves his mother, or plays golf well. Information of this sort is 
often called for, however. Judges do not often have occasion to 
observe or to learn about personal behavior and, in general, 
should not be asked to supply such information. 

The number of divisions on the scale should be neither too 
numerous nor too few. The optimum number of divisions on a 
graphic scale is perhaps five to seven. Fewer divisions than five 
causes the groupings to be too coarse; more than seven divisions 
demands fractionings of the trait which are too fine for most 
raters, with the result that a large part of the scale may be unused. 
A five-division scale is popular, since it corresponds to the mark- 
ing system A, B, C, D, and E. Furthermore, the five categories 
“high,” “above average,” “average,” “below average,” and 
“poor” seem to mark off fairly natural divisions. 

Directions to the rater should be explicit. The adequacy of the 
directions given the rater will have a substantial effect on the 
validity and reliability of his ratings. The rater (1) should be 
given as explicit directions as possible, (2) should be told what 
is meant by the distribution of a trait, and (3) should be warned 
against assigning too many “average” ratings. This last is some- 
times needed when the persons to be rated are not well known 
to the rater, when the meaning of the traits is not well under- 
stood, and when the rater is overcautious. Raters must be warned 
against the “halo effect” and the tendency to see logical relations 
among traits—to assume, for example, that intelligence and moral 
behavior or intelligence and good work habits of necessity go 
together. 

As for the distribution of traits, the best first hypothesis (in lieu 
of other information) is to assume that ratings will be distributed 
in the form of a normal curve. When the baseline of the normal 
curve is subdivided into five equal parts, the percentage in each 
division (reading from either end of the curve) are 7, 24, 38, 
24, and 7. If there are seven divisions on the scale, the per- 
centages in subdivisions are 4, 10, 22, 28, 22, 10, and 4. The direc- 


162 Personality Tests 


tions should make it clear that the exact proportions in the normal 
curve should not be followed slavishly. But the point should be 
stressed that when a group of teachers is rated for “skill in instruc- 
tion” on a five-division scale, we must expect many more in the 
middle of the scale than at either end. 

The tendency to assign too many “average” ratings (the “cau- 
tion factor”) is to be coatrasted with the tendency to assign too 
many high ratings (the “generosity factor”). Stress on the 
normal distribution of most traits will help correct this tendency: 
If the low end of the rating scale is described by such unpleasant 
terms as “stingy,” “stupid,” “mean,” this part of the rating line 
may be avoided by many raters. The “halo effect” mentioned 
above is the tendency to rate a person high on all traits if he a5 
well liked or is regarded as highly intelligent. Conversely, if a 
person is disliked, there is a tendency for him to be rated low on 
all traits. To minimize halo, the rater is usually told to rate all 
candidates on a single trait, then to rate all on a second trait, and 
so on. This procedure is impractical, of course, when the rater 
is called on to consider one person at a time and rate him on, say» 
ten traits. We are often forced, therefore, to resort to warnings 
against halo and careful definition of traits. 


Validity and Reliability of « Rating Scale 


Ratings for intelligence and for special aptitudes made on 4 
graphic rating scale can be validated against objective test scares 
But ratings of personality traits cannot be so validated, since H 
rarely have criterion scores and must perforce fall back ona 
consensus of judges, In ratings of personality traits, validity we 
reliability mean virtually the same thing. If three or more rm 
agree that Brown is a skilled worker or a friendly person, i 
average of these ratings is reliable (consistent) and valid ap 
of confidence). If two supervisors decide independently t p 
Miss Miller has a pleasing voice and sympathetic manner, “as 
confidence in these judgments is greater than if only one supe 


Summary on Rating Scales 163 


visor had so stated. In general, confidence increases with the 
number of agreements when ratings are made independently. 

It can be shown that if the estimates of two judges correlate 
.60, then the average of these ratings will correlate .75 with those 
of two equally good judges. It is, of course, difficult to decide 
when two judges are “equally good.” We can never guarantee 
this to be true, but we can (a) select judges who at least know 
the ratees well, (b) provide careful definitions of the traits to be 
rated, and (c) allow for individual differences in rating standards. 
Summary on Rating Scales i 

Ratings from graphic scales will generally deserve confidence 
when: 

1. Qualities which can be observed in behavior are rated. 


Energy, appearance, and teaching skill are better rated than 


are character and moral traits. 
2. Characteristics to be rated are illustrated. The use of behavior- 


grams (page 160) and instances will strengthen the ratings. 

3. Raters have actually observed the persons to be rated in 
situations where personality might be revealed. 

4. Independent ratings are pooled. 

5. Judges are confident that the ratings are valuable. 

6. Different standards are accounted for by explicit directions 
or by statistical techniques. 

The above rules are perhaps most useful when one has the 
problem of constructing a rating device; and they may not seem 
to be very helpful to the teacher who is faced by a ready-made 
scale. Teachers and supervisors rarely have the responsibility for 
devising a rating scale. But teachers are rated by supervisors and 
Supervisors by principals. Moreover, students are rated by 
teachers and by principals for personality traits judged important 
by colleges or prospective employers. Hence, the teacher should 
be familiar with how the rating scale is put together and how 
it works, Raters can improve a rating device by offering crit- 


164 Personality Tests 


icisms and commendations. Eventually comments of this sort 
should lead to a better scale. 


QUESTIONNAIRES AND PERSONALITY 


INVENTORIES 


The behavior inventory calls for short answers to a standard 
set of questions believed to bring out personality characteristics. 
The inventory or questionnaire is a formal interview or self- 
report rather than a “mental test”; the examinee is not required 
to solve problems, but is asked to express opinions, preferences, 
and feelings. Questionnaires have been developed by psychol- 
ogists for use in three main areas: (a) personal-social behavior, 
(b) attitudes, and (c) interests. The personal data sheet or per- 
sonality inventory deals with motives and needs, as well as with 
emotional and social factors. Adjustment to life—or more accu- 
rately, perhaps, maladjustment—is revealed by a person’s self- 
report of his worries, fears, feelings of insecurity or of depression, 
frustrations, lack of confidence and the like. The typical attitude 
inventory canvasses the examinee’s feelings, opinions, and beliefs 
about various institutions (for example, the church) and about 
social and political matters (for example, war, freedom of speech 
and internationalism). Finally, the interest questionnaire deals 
with preferences for occupations, people, school subjects (such 
as physics or history), books, sports, hobbies, and avocations. 

A personality inventory may take the direct or the indirect 
approach. In the first, the examinee is asked for specific informa- 
tion: in the second, he does not know (though he may guess or 
Suspect) the import of the questions. For example, the hn 
may be asked if he is afraid of high places (direct) or he may be 
asked (among other questions) whether he would rather bea 
bookkeeper or an airline pilot (indirect). In the indirect form 
of the question, the assumption made is that an examinee is ce 
likely to fake or rationalize his answers when he is not sure wha 
motive or what personality trait the inventory is trying to un- 
cover. 


The Personality Inventory 165 


The Personality Inventory 


Personality inventories were used by the armed forces in 
World Wars I and II to screen out the maladjusted and those 
likely to become mentally ill. These personal data (or PD) 
sheets consisted of lists of symptoms reported by men who sub- 
sequently had suffered from “nervous breakdown” or were classi- 
fied as psychoneurotic. Adult questionnaires have been revised 
by deleting the more serious and disturbing items so that they 
could be used in the schools. Questions which were removed 
deal with the more reprehensible forms of adult behavior such 
as those involving liquor and sex offenses. In the schools, the 
questionnaire is used to locate pupils with potentially handi- 
capping personality problems. The acceptability of a PD Sheet 
for pupils, parents and the community is necessary if the Inven- 
tory is to be used generally as a group test. A teacher will be well 
advised to make sure that the inventory he purposes to use has 
the approval of the school authorities. It is important that the 
reading level demanded by an inventory be carefully scrutinized, 
since many items may not be understood. 

The personality inventory is most valuable in the schools for 
counseling and guidance—that is, for spotting pupils with exist- 
ing or potential personality difficulties. VW hen used individually 
and in face to face contacts; the PD Sheet is more flexible and 


becomes essentially a directed interview. Answers given by the 


student can be pursued further until their meaning is clear. This 
cannot be done, of course, when the inventory is administered 
in group form. Of the personality inventories available (many 
cover the same ground), the following represent acceptable 


“tests” for use in the schools: 
California Test of Personality 
Pintner’s Aspects of Personality 
Gordon’s Personal Profile and Personal Inventory 
Bell’s Adjustment Inventory 


Thurstone Temperament Schedule 
Each of these questionnaires will be considered in this section. 


166 Personality Tests 


California Test of Personality* 


Description. This test series runs the gamut from the elementary 
grades to adulthood. Each battery is divided into two ae 
designed to measure (1) personal adjustment and (2) social a i 
justment. The six sub-tests in section 1 are designed to bring ow 
how a student thinks and feels about himself, his feelings of con- 
fidence and adequacy, his tendencies to withdraw within himself 
and to exhibit nervous symptoms. In section 2, the six sub-tests 
question the examinee on his knowledge of social standards, his 
social skills, his freedom. from anti-social attitudes, and his rela- 
tions to family, friends, and the community. The questions are 
Yes-No in form. Figure 7-2 gives samples from the test. 


Scope. There are five separate test batteries: 


Primary Series, kindergarten to grade 3 
Elementary Series, grades 4-8 


* Published by the California Test Bureau, Los Angeles, Calif. 


FIGURE 7-2 Sample Items from the California Test of Per- 
sonality, Elementary, Grades 4-5-6-7-8, Form AA 


PERSONAL ADJUSTMENT (Circle YES or NO) 


10. Do your parents or teachers usuall: 


ly need to tell you to 
do your work? 


YES NO 

23, Do people often think that you cannot do things well? YES NO 
25. Do you feel that your folks boss you too much? YES NO 
38. Are you proud of your school? Yes NO 
50. Would you rather stay away from most parties? YES NO 
68. Do you often feel tired before noon? YES NO 
SOCIAL ADJUSTMENT 
77. Is it necessary to thank those who have helped you? YES NO 
87. Do you help new pupils to talk to other childsen? YES NO 
101. Do people often act so mean that you have to be 

nasty to them? YES NO 
114. Do you like both of your parents about the same? YES NO 
123. Is it fun to do nice things for some of the other 

boys or girls? yes NO 
139. Do you try to get friends to obey the law? yes NO 


Reproduced by permission of the California Test Bureau. 


Aspects of Personality (Pintner) 167 
Intermediate Series, grades 7-10 
Secondary Series, grades 9-college 
Adult Series 
Over-all time for administering a test battery is approximately 50 
minutes. 


Scoring. Answers can be recorded in the test booklet itself or 
on a prepared answer sheet. Scoring i§ objective and easy. A 
profile of the different scores and over-all adjustment score can 
be constructed. The pupil’s earned score (point score) is entered 
Opposite the personality component and, the percentile rank cor- 
responding to this score is found in the appropriate table. Per- 
centile ranks for total personal adjustment and for total social 
adjustment may also be entered. 

The reliability of the five batteries is quite high (.80-.94). 
Percentile norms are provided for each sub-test and for the 
battery as a whole. This inventory is a useful indication of a 
pupil’s all-around adjustment. Diagnosis from the sub-tests is sug- 
gestive rather than conclusive; but many valuable clues which 
Serve to explain a child’s behavior may be obtained. 


Aspects of Personality (Pintner)* 

Description. This inventory consists of three parts: ascendance- 
submission, extroversion-introversion, and emotionality. It was 
designed to aid the classroom teacher in locating children who 
have developed—or are likely to develop—serious behavior prob- 
lems. Samples of the kind of items found in the test are as follows: 
I have a lot of nerve Same-Different 
I like to read before the class Same-Different 
I feel tired most of the time Same-Different 
When a child tries to push into 
am not afraid to tell him to get back 
s agreement OF disagreet 
ling “same” or “different.” 


line ahead of me, I 
Same-Different 


The pupil indicates hi nent with a state- 


ment by marking or circ 


* Published by the World Book Company, Yonkers, N.Y. 


168 Personality Tests 


Scope. Aspects of Personality is intended for use in the elemen- 
tary grades and junior high school. 


Scoring. This inventory is readily scored by means of a see 

: irls. 7 vels 

There are separate norms for boys and girls, and for two levi : 

of maturity. A low score on the ascendance-submission part in 

2 : : : ed A 

dicates a shy child, a high score an aggressive one. High scores n 
extroversion-iutroversion suggest good adjustment; low scor 

suggest withdrawing tendencies and daydreaming. Low neat 

2 fn ae : w 

on emotional stability suggest flightiness and lack of son ie 

total score is a rough index of personal adjustment, and probably 


only wide deviations should be investigated. The test often pro- 
vides useful clues. 


Personal Profile and Personal Inventory (Gordon)* 


Description. The Personal Profile is designed to measure four 
fairly distinct personality traits: (a) ascendancy, (b) responsi- 
bility (perseverance or reliability), (c) emotional stability, and 
(d) sociability. The examinee is asked to indicate which of four 
statements (there are eighteen sets) is most descriptive of himself 
and which is least descriptive. A specimen set is 

Able to make important decisions without help 
Does not mix easily with new people 
Inclined to be tense or high strung 


Sees a job through despite difficulties 
Each of these phrases is de 


of one of the four traits included in the inventory. 

This personality questionnaire uses what has been called the 
“forced-choice” technique—that is, the examinee is instructed to 
choose between statements two of which appear to be equally 
acceptable and two equally unacceptable. (See the description of 
the indirect questionnaire on page 164.) This method of pre 
senting items has certain advantages. If the two choices are fairly 
well equated for social value, it is hard for the examinee to fake 

* Published by the World Book Company, 


Scriptive—positively or negatively— 


Yonkers, N. Y. 


Adjustment Inventory (Bell) 169 


his answer, since he does not know clearly what is behind either 
choice. Again, the use of forced-choices reduces hesitation and 
indecision, since the examinee is required to make a decision, 
rather than simply choosing between Yes and No. If the examinee 
likes none of the choices, he may select the least objectionable. 
In addition to the four trait scores, the Personal Profile yields a 
total score, which may be represented graphically along with 
the four part scores. Very low total scores have been found to 
be associated with maladjustment and poorly developed per- 
sonality, 

The Personal Inventory also covers four traits: caution, original 
thinking, personal relations, and vigor. The total score depicts 
the student’s personal development in these areas. 


Scope. The Profile and Inventory are designed for high schools, 
colleges, and adults. 


Scoring. Percentile norms are available for each scale, for boys 
and girls separately, for high school, and for college. The four 
Scores and the total may be represented graphically on a profile. 
These questionnaires have considerable validity, as is shown in 
follow-up studies. Together, the two inventories are useful in 
Counseling students and in screening out those with potential 
behavior problems. Reliabilities of the sub-tests and of the total 


are satisfactory. 


Adjustment Inventory (Bell)* 

Description. This well-known inventory consists of questions 
to be answered Yes, No, or ?. It has been designed to estimate 
Personal adjustment in four areas: home (satisfactions and dis- 
Satisfactions), health (illness and general well-being), social 
Telations (shyness, aggressiveness, and so on), and emotional be- 
havior (self-confidence, depression, and the like). Samples of 
the kinds of items in the questionnaire are 


* Published by the Stanford University Press, Stanford, Calif. 


170 Personality Tests 


7 5 
Are you troubled with shyness? Yes Ne ? 
Do you daydream frequently? Yes Ne 3 
Are you often low in spirits? Yes No : 


The Bell inventory has proved useful chiefly in locating students 


who need counseling. It provides valuable leads to social and 
personal maladjustment. 


Scope. The szudent form of the Bell is for high school and 


college students. There is a form for adults which may be used in 
vocational counseling. 


Scoring. The inventory is not timed, but ordinarily requires 
about twenty-five to thirty minutes. The over-all reliability 1s 


high. There are percentile norms for high-school and college 
students and for men and women. 


Thurstone Temperament Schedule* 


Description. This inventory consists of a set of 140 questions, 
twenty items being grouped under each of seven aspects of 
temperament or emotional expression. Adjectives describing the 
seven temperamental traits are: active (degree of energy ), 
vigorous (participation in physical activities, sports), impulsive 
(happy-go-lucky), dominant (aggressive, forthright), stable 
(emotionally), sociable, and reflective (thoughtful, meditative). 
These seven behavior areas, which were identified through a 
study of the intercorrelations of many personality variables, are 
believed to constitute certain basic aspects of social behavior. The 
inventory is well adapted for use with normal people: items 
obviously bearing upon mental disease have been avoided. 


Scope. For high schools, colleges, and adults. 


Scoring. Percentile norms are av. 
and the seven scores may be plotte 
idiosyncrasies. The reliability of t 
However, the reliabilities of the sub 


ailable for the alse 
d on a profile for a study o 
he whole inventory is high. 
-sections are not high. Hence, 
* Published by Science Research Associates, Chicago, Ill. 


Ascendance-Submission Reaction Study 171 


although the inventory may provide valuable clues to a counselor, 
diagnosis based on part scores should be tentative. 


ATTITUDE SCALES 


When we know that a man is a Socialist or a Christian Scientist, 
We feel fairly sure that we can predicwhis answers to questions 
dealing with politics or religion. An attitude is a consistent point 
of view, a way of behaving toward an institution, a social group, 
or toward personal, political, or religious issues or practices. 
Attitudes may be fairly narrow or quite broad, and they may be 
strongly or weakly held. In general, an attitude pivots around 
strong likes or dislikes. A person’s attitude toward drinking, 
professional sports, popular music, or “eggheads,” for example, 
will be exhibited in expressions of opinion which are often 
emotional. 

Scales for measuring the spread and strength of attitudes have 
often been used by social psychologists but are rarely employed 
routinely in the schools. One of the most comprehensive lists of 
attitude scales (about thirty in all) has been constructed by 
L. L. Thurstone and his associates at the University of Chicago. 
These scales estimate the strength of one’s attitude (on either 
the favorable or unfavorable side) toward such diverse matters 
as war, capital punishment, the church, and communism. 

In this section we shall describe two scales both of which 
have been useful in high school and college. These are 

Ascendance-Submission Reaction Study (Allport) 
Study of Values (Allport-Vernon-Lindzey ) 


Ascendance-Submission Reaction Study* 


Description. This questionnaire attempts to determine whether 
a person characteristically dominates or is dominated in the face- 
to-face contacts of everyday life. The A-S Reaction Study is 
usually classified as a personality inventory, but it can just as 


* Published by the Houghton Mifflin Co., Boston, Mass. 


172 Personality Tests 


well (perhaps better) be described as an attitude scale, since it 
tries to discover an individual’s habitual way of behaving in 
everyday social contacts. There are two forms of the test, one 
for men and one for women. Each item presents a situation which 
might readily be encountered in school, on the street, or in 2 
store or bus. From two to four possible responses are offered. 
The examines selects thit option which most nearly represents 
what he would ordinarily do. Choices range from aggressive tO 
submissive and are weighted in such a way as to differentiate be- 
tween the two attitudes, Scoring weights for the separate items 
were determined experimentally, and the total score shows the 


strength of the examinee’s typical behavior on a dominance- 
submissive scale, 


Scope. The A-S inventory is designed for use in high schools, 
in colleges, and for adults. 


Scoring. Scoring is by stencil, the answers being weighted + 
(plus) for dominance and — (minus) for submissiveness. Per- 
centile norms are provided for high-school and for college stu- 
dents, and for adult men and adult women. The A-S Study 1s 
often useful in educational and vocational guidance. In many 
Occupations, such as nursing, teaching, library work, and clerical 
jobs, a strongly dominant attitude is a liability rather than au 
asset. On the other hand, in positions requiring leadership, 
dominant behavior and self-confidence are crucial when decisions 
are to be made. The A-S Study is especially valuable when com- 


bined with aptitude and other tests. The test has satisfactory 
reliability, 


Study of Values {Allport-Vernon-Lindzey)* 

Description. This questionn 
of six basic attitudes, describ 
by dominant interest in th 
approach to life); economi 


* Published by the Houghton 


aire sets out to gauge the a 
ed as follows: theoretical signal 
e discovery of truth, the rationa 
c (interests lie in practical affairs); 
Mifin Co., Boston, Mass. 


Study of Values (Allport-Vernon-Lindzey) 173 


aesthetic (places greatest value on form and beauty); social (chief 
interest in people); political (primary interest in power, influence, 
and renown); and religious (committed to mystical values, seeks 
to comprehend the universe). The Study assumes that a person’s 
philosophy of life is revealed by the strength of his basic 
attitudes, See Figure 7-3. 


> 
e 


Scope. College students and adults. 


Scoring. To score, one simply adds the weights assigned the 
various items. The total score for eachwf the six Values can be 
plotted on a profile to show graphically the relative strengths of 
the individual’s attitudes. Norms are for college students, for 
men and women separately, and for some occupational groups. 
The Values inventory has shown expected differences between 
medical and theological students and characteristic differences 
among other occupational groups. The inventory is useful in 
counseling and in personnel selection. It is also valuable in fore- 
casting the direction of a student's attitudes. 


FIGURE 7-3 Specimen Items from a Study of Values 


(Answers ore indiccted by checking or marking.) 


Theoretical v. Economic: 
1. The main object of scientific research should be the discovery of truth 
rather than its practical applications. (a) Yes; (b) No. 
Religious v. Social: 
9. Which of these character traits do you consider the more desirable: 
(a) high ideals and reverence; (b) unselfishness and sympathy? 
Aesthetic v. theoretical v. political v. economic: 
10. Which of the following would you prefer to do during part of your 
next summer vacation (if your ability and other conditions permit)— 
a. write and publish an original biological essay or article 
b. stay in some secluded part of the country where you 
can appreciate fine scenery 
. enter a local tennis or other athletic tournament 
. get experience in some new line of business 


From A st ek ” i iai 
Mifin Cee cmon Lindley, ‘Study of Values.” Reproduced by permission of Houghton 


174 Personality Tests 


INTEREST INVENTORIES 

The interest inventory is essentially a self-report or survey 
covering a person’s own interests, values, preferences, and feel- 
ings over a wide range of activities. The importance of interests 
early became apparent in industry when studies of worker 
efficiency made it clear that job success depends as much on moti- 
vation as on ajitude and training. In the school, knowledge of a 
student’s dominant interests is of real significance for the coun- 
selor or teacher. The printed interest inventory supplies sys- 
tematic information abeut a student's attitudes, feelings, and 
personality trends which otherwise could be revealed only in a 
long interview, if at all. Students often report patterns of interest 


which differ widely from their stated educational and vocational 


goals. The astute counselor may be able from study of the in- 


ventory to suggest occupational areas which the student hitherto 


conceal their true feelings, especially if they are 
cially acceptable.-An interest inventory 1$ 
or responded to adversely. Examinees find 
ing, and often interesting in itself. Hence 
ually honest. From an interest inventory, 
a counselor gets a clearer idea of a student’s occupational aspira- 
tions. He may get, too, valuable clues as to a student’s personality 
trends—for example, his desire for security rather than for ad- 


venture, for active rather than passive roles, for people rather 
than books. 


The best-known interest inve 
intended for adults. As a result 


not likely to be faked 
it impersonal, less pry 
their appraisals are us 


ntories are vocational and were 
» they are not very useful below 
a serious disadvantage, however; 
lementary children are often uncrystallized 
al, unreliable, and unrealistic. The moving 


. This is not 
since the interests of ¢ 


and may be superfici 


Occupational Interest Inventory 175 


pictures, TV, and romantic stories invest certain activities (that 
of the actor, the game hunter, space adventurer, and detective, 
to cite a few) with artificial glamor. Moreover, a pupil’s informa- 
tion concerning many occupations—time required, aptitudes 
needed, financial returns to be expected—is often meager and 
distorted. Even in high school and college, when choice of voca- 
tion becomes crucial, information on many occurations is not 
available. A number of pamphlets describing the requirements 
for various occupations will be helpful to counselors and students 
(see page 253). 

This section will describe three interest inventories, one suit- 
able for pupils whose reading level is up to sixth-grade standards, 
and two for high-school and college students and for adults. 


Occupational Interest Inventory 
Kuder Preference Record 
Strong Vocational Blank 


Occupational Interest Inventory* 


Description. This inventory provides scores in three interest 
areas. First, there are scores in six basic fields of occupational 
Interest: personal-social (personal contacts, service fields); 
natural (outdoor activities, farming); mechanical (machinery 
design, building, constructing things); business (activities of the 

usiness world, the “profit motive”); arts (music, literature, 
drama); sciences (chemistry, engineering, biology). Second, cer- 
tain items are designated verbal, manipulative, or computational, 
and scores in these areas provide information as to the direction 
= One’s interests. Finally, the.attempt is made to gauge the level 
Of an examinee’s interests—whether his interests identify him 
with simple routine aspects of a job or with the more expert 
Performances and skills. 

The six basic fields of occupational interest show considerable 
Overlap, and their identity as strictly separate compartments of 


* j 
Published by the California Test Bureau, Los Angeles, Calif. 


176 Personality Tests 


interest is doubtful. At the same ume, the Occupational In- 
ventory does give the counselor a better notion of the men 
and range of a person’s dominant interests than could be At tal a 
from casual conversation, and it furnishes many clues to type an 

level of commitment. Vocational or other advice based on in- 
terest scores should always be tentative, however, and subject to 
confirmation..from othe: sources. The items of the Inventory 
are forced-choice in form. The Manual gives many aoe 
to the ways in which results may be interpreted. A few samp 
from the intermediate form of the Inventory will show the m s 
of question asked. The large letter before a question gives Ae 
interest field, the small letter the interest Jevel, and the symbol Pi 
interest type (whether verbal, manipulative or computationa ). 


Part I 


: : . H u 
Directions: Draw a circle around the letter of the activity yo 


r ; ice- 
prefer. For example, if you prefer to drive an ic > 
cream truck and sell ice cream, draw a circle aroun 
A 1 as shown below: 


@DDrive an ice-cream truck and sell ice cream. 

OF 1 Wrap articles in the shipping department of a se 
However, if you prefer the second activity, draw a circle aroun 
F 1. A second item, to be marked according to the same direc- 
tions is 


AK 14 Conduct visitors throu 


gh art galleries and museums 
E 14 Help build 


automobiles, ships, or airplanes 


Part II 


Directions: Below you will find three activities under each 


number. You are to choose the one you prefer to do 
of the three in each group. Indicate your choice by 
marking the letter preceding the activity. 
f 11 Design or const 
plastic figures 


: r 
Tuct stained glass, metal ornaments O. 


Kuder Preference Record 177 


b 11 Make pottery, statues or book ends 
d 11 Carve wood or stone or make metal ornamental figures 


Scope. The Intermediate Inventory begins with grade 7 but 
may be used with bright sixth graders. The Advanced Inventory 
is for grade 9 and for adults. 


Scoring. Percentile ranks for the six basic interest fields may 
be read from the appropriate tables. Standard cores from con- 
verted raw scores are also provided. Part scores may be repre- 
sented graphically by means of profiles for a clearer comparison 
of interest-strength. The working time for the test is thirty to 
forty minutes. Scoring is simple: the items designated by letters 
and symbols are counted. 


Kuder Preference Record* 


Description. The Kuder Preference Record (Form B1) is a 
widely used vocational interest blank. There are 360 items in all, 
arranged in groups of three. In each set of three, the examinee is 
asked to indicate which activity he would like most and which he 
Would like /east. Response is made by punching a small hole with 
a pin, and the answer is recorded on a specially prepared answer 
sheet placed under the test blank. The samples below (given as 
examples in the Record) are for illustration: 


Directions: You will find a number of activities listed in groups 
of three on the following pages. Read over the three 
activities in each group. Decide which of the three 
activities you like ost. Note the letter in front of 
it and punch a hole beneath the 1 beside this letter 
in the column at the right, using the pin with which 
you have been provided. Then decide which activ- 
ity you like /east and punch a hole beneath the 3 
beside the corresponding letter in the column at the 
right. 


* r : $ z 
Published by Science Research Associates, Chicago, Ill. There are 3 forms, 
vocational and 1 personal. 


178 Personality Tests 


Example #1 
1 3 
P. Visit an art gallery OPO 
1 3 
Q. Browse in a library OQ @< least 
1 3 
R. Visit a mizeum t most > @RO 


(The punch in the hole beneath 1 beside R shows that the ex- 
aminee would most like to visit a museum, The punch in the hole 


beneath 3 beside Q mears that he would least like to browse in 
a library.) 


Example #2 
1 3 
S. Collect autographs Oso 
1 3 
T. Collect coins most > @ TO 
1 3 
U. Collect butterflies OU @< least 


choice type in that the examinee 
n among limited options. But the stipula- 


y most and least) allows some latitude. The 
test provides for ten interest- 


counselor, who can then 
The Manual provides va 


patterns typical of perso 
The purpose of the Ku 


ns successful in various lines of work- 
der Record is to reveal interest trends 


Strong Vocational Interest Blank 179 


over several broad areas rather than in fairly specific occupations. 
There are, for example, fifty-three occupations grouped under 
scientific interests. Items in the Record were first organized on 
a logical basis in the light of everyday experience and common 
sense. Later, items were analyzed statistically in order to isolate 
clusters of items highly correlated. These clusters were taken to 
reveal a core of interest. e7 

The Kuder Record relies for its validity primarily on content 
analysis and logical relations. The number of choices offered 
and their nature sometimes confuse students; and the inability 
to find clear-cut preferences may lead to dissatisfaction with the 
forced-choice aspect of the test. Below the eighth grade the 
reading level is probably too high, and the Record should not 
be used, The fact that the scoring plan does not weight sharply 
strong vs. weak interests has been another criticism of the test. 
At the same time, the Kuder Record is an excellent measure of 
the range of expressed interests, and as such is valuable in edu- 
cational and vocational guidance. It is often possible to point 
Out to a student that he has expressed many interests not in line 
with his vocational goals. The median reliability coefficient for 
the nine interest-areas is .91. 


Strong Vocational Interest Blank* 


Description. This was the first vocational interest blank and is 
still the best-known. There are forms for men and women. The 
Blank has gone through several revisions and in its latest form 
comprises four hundred items grouped under eight categories. 

hese are occupations (likes and dislikes), school subjects, 
amusements, outdoor and indoor activities, responses to peculiari- 
ties of people, choice of activities, comparison of interests, and 
evaluation of personal abilities. The examinee indicates his choices 
by circling or marking. Answers to the items are given numerical 
Weights, oltið by comparing the replies of a defined occu- 
pational group (lawyers, for example) with the replies of people 

* Published by the Stanford University Press, Stanford, Calif. 


180 Personality Tests 


in general. In all, forty-five occupations or areas of interest are 
covered by the Blank. 


Scope. For men and women. 


Scoring. A person’s score on a given scale (his interest in teach- 
ing, for example) is found by totaling the plus (+) and minus 
(—) credits cbtained from the options he has marked. A separate 
key is used for each vocation. Thus an examinee’s blank may be 
scored for the interests of an engineer, a physician, and a sales 


: : ees 
manager. Point scores are converted into standard scores t 
afford a direct comparison. A useful scale of letter grades is 
also available: A represents close 


identification of interests with 
the given vocation, B+, B, and B— somewhat lesser agreement, 
and C+ and Ca very different interest pattern from that of the 
occupation under study. For example, a college student may have 
A and B scores in the interests of a minister and social worker, and 
C+ and C scores in the interests of a mathematician or physicist. 
Somewhat less time-consuming, and often more valuable than 
specific vocational scores are the scales for interest clusters. There 
are eleven of these clusters, for example, personal-social (“uplift, 
interest in social science, interest į 


n becoming a teacher, preacher, 
or school superintendent); mathematics-science (chemist, engi- 
neer), and business- 


commercial (salesmen, various business inter- 
ests, making money). The use of clusters or occupational families 
provides greater fle 


xibility in the use of the Strong Blank. 
The Vocational Blank is in realit 
its aid, the counselor b 


and directions of a 


which his parents have f 
out discrepancies and tr 
all concerned. 


or him. The counselor must then ee 
y to resolve them to the satisfaction 0 


SUMMARY ON PERSONALITY INVENTORIES 


Validity. Insofar as an inventory 


includes questions which 
experts agree are relevant to the area 


being tested, the question- 


Strong Vocational Interest Blank 181 


naire has content validity. The adjustment inventories (PD 
Sheets) are made up of items drawn from texts on abnormal psy- 
chology and cover conditions which have been found to be 
symptomatic of mental illness. The interest inventories have been 
validated experimentally against a number of criteria: expressed 
Interests of successful professional and businessmen, successful 
completion of training courses, ratings ior work sxcess, staying 
in an occupation vs. leaving it, and degree of job satisfaction. 
Correlational analysis has been used to locate clusters of items 
which embrace a common core of interest or a community of 
interest patterns. Follow-up studies of the Strong Vocational 
Interest Blank show that men tend to stay in occupations for 
which they expressed strong interests as students and to change 
Occupations for which their expressed interests were weak. 

In using interest inventories, several precautions should be 
taken. It is well to remember that interest and aptitude are not 
the same thing, and that many youngsters express interest in 
vocations for which they have little capacity. Again, the interests 
of young people—especially those below the age of 25—often 
change markedly. Adolescents may express unrealistic interests 
which change drastically later on. More than one determination 
of interests, therefore, should be made. Finally, it must be 
remembered that advice about occupational families is much 
safer than advice about specific jobs. Any inventory, personality 
or interest, should be supplemented by school and intelligence 
records, ratings for health, appearance, motivation and socio- 
economic status. 


Reliability. The reliability coefficients of most inventories is 
high— .80 or more. As interests change over a period of time, 
reliability determinations can be relied on for short periods only. 


Scaling. Inventories are usually scored by assigning weights to 
the various options presented. These points are converted into 
Percentile norms, standard scores, and sometimes letter grades. 
Norms for the adjustment inventories are most often for stu- 
dents, less often for occupational groups. The interest inventories 


182 Personality Tests 


: a ‘fo 
report norms for occupational families (Kuder) and for a 

i Januz i sug- 
occupations (Strong). Test Manuals provide many useful sug 
gestions for the interpretation of the inventories. 


SUGGESTIONS FOR FURTHER READING 


Anastasi, A. Psychological Testing. New York: Macmillan, 1954. 

Freeman, F. S. Theory and Practice of Psychological Testing (Rev. 
edition). New York: Holt, 1955. vn 

Jordan, A. M. Measurement in Education. New York: McGraw-Hill, 
1953. 


Travers, R. M. Educationé Measurement. New York: Macmillan, 1955. 


SUGGESTIONS FOR LABORATORY WORK 


1. One of the best ways to become acquainted with a personality iyen 
tory is to take it yourself. Members of the class should take as many ° 


; i; B S 
the questionnaires as are available, score them, and draw up profile: 
where called for. 


2. Examine the Manual for the K 
What is said about validity, relial 
3. Study the Manual for the 


Manual for the Bell Adjustme: 
constructed? 


uder Preference Record (Vocational). 
bility, scaling techniques, and norms: 
Allport A-S Reaction Study and/or the 
nt Inventory. How are these inventories 


QUESTIONS FOR DISCUSSION 


1. In an adjustment inventor 
comes the score. What is mean 
for a personal data sheet? 

2. Which interest blank, 
for high-school students? 

3. How might an interest i 
sonality trends? 

4. How closely related are i 
ship change with age? 

5. Why are personal data blan 
“group tests”? 


y, the number of positive ss aaa 
t by saying that Stanley is on the media 


the Kuder or the Strong, is more appropriate 
nventory be used in studying a child’s per- 
mterests and aptitude? Does the relation- 
ks of little value when administered aS 


6. Under what circumstances do you think the interest inventory 


would be most helpful? At what age levels? Give reasons for your 
answers. 


Strong Vocational Interest Blank 183 


7. What factors limit the usefulness of paper-and-pencil adjustment 
inventories? 

8. A high-school senior expresses a strong interest in engineering, but 
his interest inventory score does not confirm this interest. What would 
you as counselor suggest to him? 

9. The Strong Vocational Interest Blank has a key for various. specific 
occupational interests—dentist, banker, carpenter, for example. What 
difficulties do you sce in such restricted interest patterns? 

10. Why is a personal data sheet easier to fake than an interest blank, 
even when the items are not forced-choice? . 


KX 


CHAPTER 8 


\OBJECTIVE-TEST ITEMS AND 
SHORT-ANSWER TECHNIQUES 


knows the principles whi 
items and the assembling 


184 


Comparison of Objective Items and Essay Questions 185 


This chapter will describe some of the better-known—and 
more widely used—verbal objective type items. These include 
the true-false, multiple-choice (best-answer), matching, comple- 
tion, and short-answer essay questions. The advantages and dis- 
advantages of each of these item types are listed and examples 
given to illustrate errors to be avoided in writing itep?» of each 
type. Objective tests employ numbers, geometric forms, pictures, 
and diagrams, as well as words. (Figure 8-1 provides a number of 
illustrations.) Some of the varieties of non-verbal items frequently 
encountered in standard tests fall under the following heads: 


v1. Number Series Completion.* The examinee is asked to com- 
plete a series of numbers—which are related in some way— 
by the addition of one or more appropriate numbers. 

2. Figure Completion. The examinee must complete a figure by 
the addition of a line or other detail. 

3. Likenesses and Differences. From a list of pictures showing 
objects or activities, the examinee is required to select several 
which belong together, or to select an item which does not 
belong with the others. 

4. Picture Completion. The examinee is to complete a picture 
from which one or more items have been omitted. 

5. Errors in a Drawing. The examinee must locate and correct 
errors in a drawing. 

6. Arranging Pictures. The examinee is to arrange a set of pic- 
tures in orderly fashion so as to tell a story. 


Most non-verbal items are variations on the multiple-choice 
type. Non-verbal items are frequently used in tests designed for 
young children. (See page 119.) 

a 
Comparison of Objective Items and Essay Questions 

The traditional essay question often covers too much ground, 
and is open to large errors in scoring and interpretation. Con- 
sider the question “Discuss the causes of the War of 1812” as 


* This test is also classified as verbal. 


FIGURE 8-1 Objective-Test Items 


i . The 
Directions: The examiner reads several statements about each set of pictures. 


student is told to put a + 


ue gi in the ( ) after the number 
er if the statement is true, a 
amen) O if it is false. 
et s : 
&( do Example: The examiner 
BG oF reads, “Un cheval vient a 
à t H de s'abbattre sur la route. 
© wð 
The student puts a + 
t de iishe Janer the number 
C da of the statement. 
€ 


(4) kindling temperature 
(5) paper 


(3) ashes given off (6) combustion 


Answer ( ). Student Puts in the number of the answer. 


Directions: Each of the following incom 
by 5 possible answers. For each item, 
question and write its number or 


plete statements or questions is followed 


select the aniwer that best completes the 
letter on the line to the right. 


31. A claw hammer is shown in picture 


(b) polish metal, (c) drill holes, 
Teer SS (d) take dents out of metal, 
32 a a, shown in picture (e) caulk metal. =. 
pant —— 40. Tool #2 can be used to (a) mark 
33. A ball peen hammer is shown in metal, (b) file metal, (c) drive 
picture 13568 


a a screw, (d) fasten a bolt, 
39. Tool #1 can be used to (a) file metal, (e) lock a bolt, — 


Represented by Picture, Drawing, or Diagram 


YOU 
B' Directions: If the two TRUE FALSE CANNOT 
equal circles whose TELL 
centers are O and O’ et 
8 have <AOB= <A‘O'B’ o Oo : O 
J> then arc AB= cic AB’. 
‘ 
A 


or 


Directions: Which of the five figures can be made from the pattern in Example X? 


More than one may be correct. | 


x 


Directions: The first three pictures in each row are alike in some way. 
Decide how they are alike, and then find the one picture among the four to the 
right of the dotted line that is most like them and mark its number. 


188 Objective-Test Items and Short-Answer Techniques 


an example of a common form of essay question. Answers to this 
question will almost certainly include material that is true and 
relevant, material that is ambiguous, material that is clearly 
erroneous, and material that is mostly padding. It becomes well- 
nigh impossible for two or more readers to evaluate the answers 
to such 2 question in the same way. However, when choices in 
objective test questions ave recorded by checking one of several 
possible answers, circling a number, or underlining a word or 
phrase, the grade on the test will be the same whether the scoring 
is done by a clerk or by an expert. And the answer will be right 
or wrong. £ 

Examinations composed of objective items possess several other 
advantages over questions of the essay type. The objective item 
not only eliminates unreliability due to personal opinion but 3s 
the more easily scored, is economical of time, and allows for 4 
wider sampling of material. Furthermore, the objective test item 
forces the student to answer a question directly, gives him little 
Opportunity to equivocate or dodge, and is, for that reason, 2 
more dependable measure of what a student knows. On the nega- 
tive side, the objective item may provide little opportunity for 
the examinee to display his understanding and organizing ability- 
When poorly made, the objective item may lay too much stress 
on rote memory and unrelated bits of information. ~ 


Defining the Purpose of the Test Item 


It is necessary to keep constantly in mind the purpose we 
intend our test items to serve. Items may then be selected—or, 
in a standard test, examined—with these objectives in view. We 
cannot always be sure, it is true, of exactly what a given item 
is measuring. But we can sharpen our aim by setting up definite 
specifications (page 211) which we want our items to meet. For 
example, an item should: 


1. Elicit information (often fairly specific) which reveals an 


understanding of a process, principle, situation, or historical 
movement. 


Assembling Test Items 189 


2. Require the examinee to demonstrate knowledge and use of 
technical terms and concepts. 

3. Give the examinee a chance to show his ability to apply a 
principle in the solution of a problem, draw a conclusion, 
arrive at a generalization. : 

4. Call forth responses which will reveal the examince's atti- 
tudes, interests and personality traits. 


Not every item, of course, can be fitted neatly into one of 
these categories. Some (many, we hope) will cut across several. 
Nevertheless, each item should be written to achieve a definite 
purpose, to call out some important bit of knowledge, under- 


standing or application. 


Assembling Test Items 


In the process of making an objective test, the type of item 
to be used must be decided upon and the items written, before 
we are ready to assemble them into tentative form and try out 
the test. Several problems arise: determining the difficulty of 
the items and their discriminative power, drawing up directions, 
and preparing a key and scoring sheets. Methods for-carrying 
out these procedures will be treated in Chapter 9. 


TRUE-FALSE ITEMS 


The true-false test presents a series of statements or questions 
each of which is to be marked “T” (true) or “F” (false). Instead 
of circling one of the letters “T” or “F,” the examinee may be 
asked to circle “Yes” or “No,” or to write + (plus) or — (minus), 
or in some other way to designate a positive or negative answer. 
One of the earliest objective forms, the T-F test is still widely 
used in group intelligence as well as in educational achievement 
and aptitude tests. It has been criticized as being a measure of 
Tote memory, a test of detached and unrelated facts, and as often 
being ambiguous and equivocal. Such strictures are justified when 


190 Objective-Test Items and Short-Answer Techniques 


the test is poorly or carelessly made. There is a large element of 
guessing in T-F tests, too, and good items are not easy to con- 
struct, however simple the process may seem to be. But when 
well made, T-F items have valuable possibilities arising from 
their scope and flexibility. The chief advantages and disadvan- 
tages oi the T-F item may be summarized as follows: 


Advantages: 


1. It may be used with a wide variety of materials. 
2. It may be scored easily and objectively. 

3. It is the easiest objective type to construct. 
4 


. It makes possible an extensive sampling of material in a rela- 
tively short space. 


5. It is a time-saver, thus allowing 


for frequent testing. 
6. The directions are readily unde 


rstood and followed. 
Disadvantages: 
1. It is often ambiguous and confusing. 


2. Itis open to guessing and to chance effects. S 
3. Much subject matter cannot be stated as unequivocally true 


or false. 

4. Itmay readily become a test of detached and unrelated bits of 
information. . 

Sa It may overstress rote memory at the expense of under- 
standing, 


‘Some of the rules useful in Constructing teacher-made tests are 
given below. In judgin 


; g the adequacy of printed T-F items, it 
will help to note whether these rules have been observed. 


1. Putting the symbols “T” and “F” 
preferable to having the examinee write the letters at the end of 


a statement, thus Scattering his answers over the page. Circling 
or marking saves time in scori 


before each question 1s 


Assembling Test Items 191 


On the test paper: On the answer sheet: 
Fol. LOFE 


T® 2 2 T® 


2. Make the number of true statements equal to the number 
of false statements. The scoring formula for T-F items is 


Score = Right — W: rong 
or Score = Total — 2 X Wrong 


Either of these formulas corrects for guessing, and both give 
the same result provided the pupil has tried all of the items. 
Suppose for example that there are sixty items in the test, and a 
pupil gets forty right and twenty wrong. Then his score is 40 — 20 
or 20, or 60 — 2 X 20 or 20. If the child does not try all of the 
items, the two versions of the formula will not give the same 
result and the first (R — W) should be used. 

If an examinee guesses at every item, he should have one-half 
of the items right and one-half wrong, and his score (R — W) is 
properly zero. If an examinee attempts only thirty out of forty 
items in a given examination, his score may be corrected to a 
total of 40 by adding one-half of the untried items, that is, half 
of 10, to his number right. (Presumably he would get one-half 
of the untried items right by guessing.) It is not necessary to 
Correct every paper to the number of items in the test. But test 
scores for a class are the more fairly compared when all are 
based upon the total number of items in the test. 

The correlation between number right and (R — W) is per- 
fect when all of the items of the test have been tried. Hence, 
when a child’s score has been corrected to the total, number 
right may be taken as the score instead of (R — W). The ques- 
tion of whether to tell an examinee to guess has excited much 
Controversy, partly because of the opprobrium attached to the 
term guessing as related to school examinations. A good general 
Tule is to instruct the student to omit only those items which he 
Is sure he doesn’t know, to try an item even when not entirely 
certain of the answer, but never to guess wildly. Since the exam- 


192 Objective-Test Items and Short-Answer Techniques 


inee has been exposed, at least, to the subject matter of the test, 
the chances are better than even that his answer will be based 
on some information, even if it is vague and uncertain. Hence, 
a T-F answer is more likely to be right than wrong. 


3. Avoid opinionated and trivial (or trick) items. 


Examples: T F Character is more important than intelligence. 
T F The ABC Test of Mental Maturity contains 75 
items arranged into 6 sub-tests, 
TF William Collins Bryant is the author of Thana- 
topsis. 
T F One-half of a perfect correlation is .50. 


The first of these items calls for a value judgment, which may 
be true or false; the second and third ask for trivial information; 
and the fourth is a trick questions which happens to be false. 

J4. Avoid ambi 
false, and th 


Examples: 


guous statements, those partly true and partly 

ose containing negatives, especially double negatives. 

T F Socio-economic factors are often the cause of 
war. 

T F William Jennings Bryan, the great Commoner, 


was twice elected president of the U.S. 
T F Not every 


5. Avoid textbook language a 
items encourage Tote memory 
taken out of context, 


Examples: T F The role of the teacher is to help the pupil es- 
tablish satisfying goals. 


nd verbatim quotations. Such 
and are often ambiguous when 


Assembling Test Items 193 


T F Heredity determines what a man can do, en- 
vironment what he does do. 


Textbook verbiage aids in making a correct guess. 


6. Avoid specific determiners, such as all, none, always, never, 
every. Broad generalizations introduced by these words are 
usually false. j 
Examples: T F Feeblemindedness is always present in delin- 

quency. 
T F Corporal punishment is never justified. 
T F All ministers lead lazy lives. 


These items are all too general and all are incorrect. 


The T-F item is not so popular among teachers as it was 
formerly, and it is not found so often in standard tests. It is still 
ranked high, however, and is perhaps the quickest way of sur- 
veying a wide range of material. When supplemented by other 
test forms, T-F is a valuable objective item. 


MULTIPLE-CHOICE OR BEST-ANSWER ITEMS 


The multiple-choice item consists of a statement, question, 
phrase, or word followed by several responses only one of which 
is correct, Multiple-choice is one of the most flexible of the 
objective-recognition-type forms. It is a favorite with teachers 
when making their own examinations, and is most widely em- 
ployed in the standard printed forms. Multiple-choice items can 
be so constructed as to measure information, comprehension, 
understanding of principles, and ability to interpret data. The 
test form is applicable to most subjects and to most materials. 
. Some of the strengths and weaknesses of the multiple-choice 
item can be summarized as follows: 


Advantages 


1. Answers are objective and are rapidly scored. 
2. Items may be written to measure inference, discrimination, 


and judgment. 


194 Objective-Test Items and Short-Answer Techniques 


inimi zi i red. 
3. Guessing is minimized when four or five choices are allow : 
: e OS- 
4. Items may be constructed to measure recall as well as rec g 
nition. 
Disadvantages 


1. ltems are often too factual, stressing memory unduly. 


eth Jary ct. 
2. More tian one response may be correct or y ery nearly corre 
3. It is difficult to exclude clues. 
4. Distractors—that is, 


i 7 —are 
incorrect but plausible answers—ar 
often hard to findy 


is 
i i ice i -made 
Rules for constructing multiple-choice items for teacher-ma 


judgi tn oe ts 
tests and for judging the adequacy of such items in printed tes 
are as follows: 


1. Vary the position of the Correct response: put the right 
answer in the first, second, third, fourth positions equally often. 
A scoring formula for multiple-choice items is 

Score = Right — (Wrong) 


(2 — 1) 


in which 7 = the number of choices 
formula is used to corr 


of the question, he is more 
answer. In most education 
ber right as score saves tim 
purposes. It must be remem 
of options must be the same 
formula is to be used, 


2. Do not include responses wh 
ble or so unrelated to the 
Distracting responses shou 


: > i 
ich are so unlikely or implaus 
question as to give the answer away- 
Id distract, not confuse. 


Assembling Test Items 195 


Examrples: The function of a flower is to 
give pleasure to mankind 
attract insects 
illustrate the modification of leaves 


produce seed 


The capital of the United Staies is 
Washington 

——___ Rome 

— Tokyo 
London 
Honolulu 


The principal crop of Iowa is 
pineapples 

corn 

oranges 


bananas 
In the first example, assuming the fourth choice to be the cor- 


rect one, the distractors are all rather silly. In examples two and 
three, an examinee would have to be almost totally ignorant of 


geography to be taken in by the distractors. 


3. Do not provide wrong answers which are plausible enough 
to mislead the good student because they are close to the right 
answer, The good student is often led astray by knowing a good 
deal—but not quite enough—about a question, whereas the poor 
student does not know enough to be misled by a plausible but 


Wrong answer. 


Example: What was one of the important immediate results of 


the War of 1812? 
the introduction of a period of intense section- 


alism 

destruction of the U.S. bank 
____defeat of the Jeffersonian party 
_____ final collapse of the Federalist party 


196 Objective-Test Items and Short-Answer Techniques 


i t 
The fourth response is keyed as the correct one. But 39 per = 
of eight hundred high-school pupils, all of them AA pa 
dents, checked the first option as correct, Apparently the 


i i he 
answer is plausible to students who know a good deal about tl 
War of 1812. 


4. Do not give away-the correct answer by providing se 
such as (a) familiar textbook phrases, (b) having the ng nt 
option consistently longer or shorter than the wrong options, 
(c) repeating the words of the question, (d) asking ein 
to which the answer miist be singular or plural, with only the 
correct response being in the right number. 


Examples: In what major labor group have unions been organized 
on an industrial basis? (Circle one letter.) 
A. Congress of Industrial Organizations 
B. Railway Brotherhoods 
C. American Federation of Labor 
D. Knights of Labor 
E. Workers of the World 


The meaning of the German word Gestalten is (Check 
one) 


a response ; 
————, just-noticeable-difference 
——— stimulus 

configurations 

a perception 


A man hears a loud noise and runs to the window: 
This is an example of 

———motivation 

memory image 

——stimulus-response 


posthypnotic suggestion 

Purposive behavior 

se examples, the adjective “industrial” in the 
answer away. In the second, if the student 


In the first of the 
question gives the 


Assembling Test Items 197 


knows that Gestalten is the plural of the German word Gestalt, 
he has the answer as “configurations.” In the third, the textbook 
phrase “‘stimulus-response” is a clear clue. 


5. In a multiple-choice vocabulary test, none of the response 
choices should be as difficult as the test word. The difficulty of 
response words can be determined from their frequency in 
Thorndike’s Teachers Word Book. Response avords should be 
of the same part of speech as the test word, and only one should 
be correct. 


Good example: An irksome task is (a) pleasant, (b) engrossing, 
(c) instructive, (d) wearisome 

Poor example: Do not despise him means do not (a) hate, 
(b) malign, (c) deprecate, (d) dessicate him 

In the second example, some of the response words are more 

difficult than the test word. This is not true of the first example. 


6. Direct questions or statements followed by a series of 
options are usually clearer than questions in which the answers 
are imbedded in the statement. In the latter form, the examinee 
must read through the statement for each option. 


Good example: A 10-year-old receives a percentile rank of 40 
on a test of arithmetic. This means that 

he is above the mean of 10-year-olds on 
the test. 
he exceeds 60 per cent of 10-year-olds. 
40 per cent of 10-year-olds did worse 
than he. 
61 per cent of 10-year-olds exceeded his 
score. 

Poor example: Percentile rank shows the per cent (a) at or 
above, (b) above, (c) at, (d) below, (e) at or 
below the given score. 

The second example is more difficult to decipher than the first. 


A test made up of multiple-choice items takes more time to 


198 Objective-Test Items and Short-Answer Techniques 


construct than a test of T-F items. Furthermore, good eg 
choice items are harder to prepare than T-F items, a T 
often difficult to find acceptable distractors. The rona ee 
the T-F item is largely offset, however, by the fact that mu Pi 
choice items are more searching and demand a more ME ad 
knowledge of the subject-matter. Multiple choice is regs 


by most test experts as the best of the short-answer forms. 


multiple-response items as well! 
Two examples follow: 


Example: Under each of the following psychological doctrines, 


y kai c) those 
viewpoints, or systems, indicate by a cross (x) t 


E ma : ristic 
implications or consequences which are characte 
of that doctrine. 


1. English Associationism 
~~A persisting self i 
— Summation and integration of mental states 
——universal Categories of reason 
———mental faculties 
———persisting motor-response systems 

2. Purposive Psychology ( McDougall) 
———imageless thought 
——— introspection as the 


primary method 
——S-R units 


——— motivation in terms of instincts 


Assembling Test Items 199 


doctrine of the unconscious 
conative tendencies 
Each of these items can be described by more than one choice. 


MATCHING ITEMS 


In the matching test, one list of words, names, phrases, for- 
mulas, or statements is to be matched against nother list. The 
test may consist of (a) a list of names in one column to be 
matched against a list of achievements in a second column; 
(b) a list of terms to be matched agaiast a list of definitions; 
(c) labels to be matched against charts and diagrams; (d) authors 
to be matched against books, dates and events. Y 

The matching item possesses the advantages of interest and 
variety as well as ease of scoring. It is, furthermore, somewhat 
easier to construct than the multiple-choice item. Matching has 
been frequently used to test the relationship between dates, events 
and various facts. On the negative side, the matching item often 
Measures recognition memory rather than understanding, and is 
especially open to clues. Nor do matching items ordinarily test 
ability to organize facts or to apply principles. 

Rules for making up matching test items may be set down as 
follows: 


1. Do not include too many items in the lists: 10 or 12 is the 
maximum, 5 or 7 often enough. When lists are long, examinees 
must spend too much time hunting through them. Have the 
number of items in the column from which selections are to be 
made larger than the number in the list to be matched. This 
lessens the chances that an examinee will match an item correctly 
bya process of elimination. 

Example: The following statements are representative of differ- 
ent schools of psychology. In the blank spaces before 
the statements, write the number of the psychologist 
for whom the statement is typical. 


200 


Explain clearly the b 


Objective-Test Items and Short-Answer Techniques 


(1) Adler (7) McDougall 
(2) Angell (8) James Mill 
(3) Calkins (9) Pavlov 

(4) Freud (10) Titchener 
(5) Jung (11) Watson 

(6) Koehler (12) Woodworth 


z Sensory processes have the attribute of clear- 

ness, just as they have quality and intensity. 

There is evidence for the existence of three 

types of native and unlearned emotional reac- 

tions—fear, rage and love. 

The inadequacy, the relative futility, of all 

atempts to ignore the purposive, the goal- 

seeking nature of behavior renders behavior- 
ism untenable, 

— Any mechanism, except perhaps some of the 
most rudimentary that give the simple reflexes, 
once it is aroused, is capable of furnishing 1tS 
own drive and also lending drive to other con- 
necting mechanisms, 

— The will-to-power is the great motive in men- 
tal conflict. 

— The superego re 


| presents the repressions of in- 

stinct and dominates the ego. 

——Mind is primarily engaged in mediating be- 
tween environment and the needs of the or- 
ganism. 

——Sensations are one of the primary states of 
consciousness; ideas are the other. 


. i n 
one subject-field only, so that a gre 
S several plausible matches in column 2+ 
asis of the matching, 


Example: In column 1 are words which illustrate a number of 


. eer : s 
parts of speech; in column 2 is a list of various pa!" 


Assembling Test Items 201 


of speech. Determine what part of speech a word is and 
then identify it by putting its number before the 
proper item in column 2. For example, “boy” is a 
noun, and if “boy” were numbered 5, a 5 would be 
placed before the word “noun” in column 2. Arrange 
the choices in alphabetical order. 


(1) and —— adjective , 
(2) ‘eat ____—-adverb 
(3) rapidly a OMY 

(4) jump ______preposition 
(5) from —= verb 

(6) rich 


(7) either 


Example: Match the items in column 1 with the appropriate 
items in column 2. 


A. Harvey contagious disease 

B. stomach ____digests food 

C. poison discovered circulation of the 
D. Galen blood 

E. lungs early Greek physician 

F. heart ______ supplies oxygen to the blood 
G. measles 


The first example is quite easy. But it should enable a teacher to 
spot grammatical confusions. ‘All of the material is from the field 
of grammar. The second item is poor owing to heterogeneity in 
the list of choices (names and bodily organs). 


3. Arrange names in alphabetical order, dates and numbers in 
sequence in order to save the examinee’s time. 


Example: Select the inventor from the first list and put his num- 
ber opposite his invention in the second list. 
(1) Colt —— Atlantic cable 
(2) Edison cotton gin 
(3) Field _____ electric starter 


202 Objective-Test Items and Short-Answer Techniques 


(4) Franklin — sewing machine 
(5) Howe —— steam engine 

(6) Kettering wireless telegraphy 
(7) Marconi 

(8) Watt 

(9) Whitney 


4. Avoid clues, for instance, one singular item in both a 
the others plural; one item in the list of a different part of speech 
from the others. Watch for irrelevant (but revealing) associa- 
tions, such as nationality, which give away the matching—for 
example, if the examinee knows that a certain discovery was 
made by a Frenchman, he will look for a French name. 

The matching item is compact and usually interesting to sue 
dents. It enables a teacher to cover a wide territory in fairly 
short time. Matching is well suited to rapid surveys of apais 
aspects of a field when persons, events, or definitions are wanted, 


or when these Fonstitute necessary knowledge for further work 
in the subject. 


COMPLETION ITEMS 
In this test form 
words or phrases h: 


to perceive over-. 
for guessing. The chief disadvantage 
scoring, which is not entirel jecti 
ing, and in the fact th 


amination-making, although it is not 
ultiple-choice and T-F items. 


etion items and errors to be avoided 
in such items are as follows: 


1. Do not copy sentences and Paragraphs directly from the 


Assembling Test Items 203 


textbook, since this puts too much emphasis on rote memory and 

parrot-like learning. Rephrase the language of the text, if that is 

used. 

Example: Human behavior, more than that of any other animal, 
is a product OP isacexss Sse e 


Example: Much learning is by trial TOTEE ; 
The first is a poor item. It is out of the textbook and will be 
known by those who recall the textbook language. The second 
is also a poor item—the pat expression “trial-and-error” gives it 
away. 

2. Too many blanks make it impossible for the examinee to 
get the meaning, especially if the sentence is short. 


Example: Civilized man ..--+-+++ ; uncivilized man .....++.+ 
This item actually appeared on a printed test. It is impossible to 
complete it, or else it can be completed in a wide variety of ways, 
most of them not indicative of much knowledge. 

3. Scoring is more objective if words rather than phrases are 
deleted. Blank out key words—those which carry the meaning of 
the sentence or paragraph—not unnecessary elements or the 
articles a, an, the. 


Examples: Democracy is that form: Of 6.06600 ascaviee spann in 
which all of the ....---++-++: exercise the 


Democracy is that .... 
ernment in which 


people ...- 


3 governing power, 
representatives elected rere eee ee 


themselves. 


The first form of the item is the better, since the blanks contain 
key words, The second version deletes connecting words which 
do not carry the meaning of the sentence. 


204 Objective-Test Items and Short-Answer Techniques 


17) established the first laboratory 
BORE os eto sis cars ore angen study of psychology in 
rea ahs sane ~~ 

This is a satisfactory item, if we want to know who establis! 


S : : it was 
thefirst laboratory in experimental psychology and when it 
founded. G 


4. Make the blanks long enough to permit legible answers. 


Have all blanks of standard length to avoid clues as to the length 
of the completing word. 


j j 3 ch 
good plan is to allow one point for ea 
correct answer, none for an incorrect one, 

6. Guard against c 
not depend upon (a) 
expressions. 


; eae do 

lues by taking care that completions k 
Z ta ook 

grammatical form, (b) pat or textbo 


Examples: Johnny wears his 5 
to bed. 


; Eat a e 
A much discussed question is the relative importance 
of heredity and... 


In the first item, the first singular verb is a clue to the number a 
the second verb. The second item tests rote memory, and t 
“pat” expression “heredity and environment” gives it away. 


pace suit, even when he ......-++ 


THE ESSAY QUESTION 
“The essay question has 
years. It is widely used in t 


Si en hat es- 
the essay question is important. Qu a 
ho,” “what,” “when,” and “where” a 


Restricting the Essay Question 205 


usually to be avoided when they ask simply for a name (for 
example, Napoleon), a date (1492), an event (Battle of Hastings) 
or a location (New York City). But such questions are valuable 
when the information asked for is relevant to the solution of a 
problem, the making of an inference, or the interpretation, of 
some event. Questions beginning with “why,” “how,” “with 
what consequences,” or “with what significance” are to be pre- 
ferred to simple fact questions. Questions beginning with such 
words as “discuss,” “evaluate,” “outline” and “explain” invite— 
and usually get—a mass of detail, some not relevant. Such ques- 
tions are useful, of course, when we wish to know how well 
an advanced student can select, reject and organize. But they 
are hard to score and are virtually useless in a broad survey or 
for the diagnosis of specific blindspots. / 


Restricting the Essay Question 

The essay question becomes objective when cast into short- 
answer form and restricted in coverage. Two methods of con- 
trolling the essay question and rendering it more specific may be 
mentioned. 


Recall Questions. Recall items are essay questions reduced to 
the simplest terms. Usually a question is followed by a blank 
space varying in length. Answers are restricted to short para- 
graphs, the account of some event, an algebraic equation and its 
application and the like. Recall items resemble the completion 
type, but they provide for fairly free answers and are less 
restricted. 


206 Objective-Test Items and Short-Answer Techniques 


(3) List three conditions which must hold true if an 


intelligence test is to yield a constant IQ. 
l; 


i i ; ask 

The first item calls for a one-line answer, Items (2) and (3) E $ 

for specific but basic knowledge. Compare (3) with the essay 
question, “Discuss the construction of the Stanford-Binet. 

Problem Situations. A 


questions are asked, e 
the situation. 


A “0 
problem is stated, and 2, 3, or 4 ea 
P l K 
ach focused on some important aspect 


Example: A skillful teacher h 
(a) maintains a 
(b) avoids negative discipline. 
(c) conforms to the wishes of parents. 


(d) does not use repetitive drill. 
Write one par: 


these propositi 


as been characterized as one who 
permissive atmosphere. 


agraph defending or attacking each of 
ons—that is, four paragraphs in all. 

A recurring problem in 
maturation. Cite the evid 
from the following poin 
(a) neurological 

(b) co-twin control 
(c) parallel groups 


Example: child development is that of 


; 
ence bearing on the problen 
ts of view: 


Scoring the Essay Question Objectively 

Perhaps the major weakness o 
the unreliability of its scorin: 
tive by the use of the follo 


f the essay examination lies 1n 
g- Scoring can be made more objec 
wing techniques: 


e 
arked anonymously, ther 
agreement between different scorers. 


A ; : d 
2. There is less Opportunity for preferences, attitudes, an 


Scoring the Essay Question Objectively 207 


biases to appear when all papers are read for one question at a 
time rather than each paper straight through. Obviously, com- 
parisons can be sharper with this method. 


3. Before reading a question, the teacher can list the basic facts 
which the question is intended to bring out. Points may then 
be assigned to these aspects of the answer. For example, if the 
question deals with a chemical process, the answer list may in- 
clude (a) the necessary equations of the process, (b) the chemical 
elements needed, (c) a diagram of the apparatus, and (d) any 
by-products of the chemical reaction. Jf the question deals with 
English literature, the answer may include (a) the author’s chief 
contribution, (b) the cultural setting of the time, (c) the influ- 
ence of the author’s work. A check list of key points, with credits 
assigned to each, is a useful technique. Thus, from one to three 


points may be assigned to each part-answer. 


4. If the teacher marks the papers for spelling, writing quality, 
and grammatical expression, as well as for content and organiza- 


tion, credits should be allotted to these aspects of the answer. 
a valuable examination form when held 


to one or more defined themes, so that it is scorable. Many 
teachers are so impressed by the general use of objective-type 
items in the standard tests that they are inclined to drop the 
essay entirely. This is a mistake. Many courses, especially ad- 
vanced courses, in literature and in science employ objective-test 
items as a first approach to an examination of the subject. But 


the essay question is the best (perhaps the only) way in which 
a teacher can determine whether a student can organize his 
knowledge and arrange his arguments in logical fashion. Short- 
answer forms should be regarded not as substitutes for the essay, 


but rather as supplementary to it. 


The essay question is 


SUGGESTIONS FOR FURTHER READING 


Gerberich, J. R. Specimen Objective Test Items: A Guide to Achieve- 
ment Test Construction. New York: Longmans, Green, 1956. 


208 Objective-Test Items and Short-Answer Techniques 


Remmers, H. H., Ryden, E. R., Morgan, C. L. Introduction to Educa- 
tional Psychology. New York: Harper, 1954. — 

Ross, C. C. and Stanley, J. C. Measurement in Today’s Schools. New 
York: Prentice-Hall, 1954. 

Travers, R. M. How to Make Achievement Tests. New York: Odyssey 
Press, 1950. 


Wrightstone, J. W., Justman, J., Robbins, I. Evaluation in Modern 
Education. New York: American Book, 1956. 


Note: Most textbooks on educational 


psychology and in measurement 
and evaluation contain chapters dealing 


with objective items. 


QUESTIGNS AND PROBLEMS 


1. Write five true-false items in some subject field familiar to you. 
2. Rewri 


literature. 


4. If possible, put the items in number 2 in completion form. 


5. Rewrite the following essay questions to make them more objective 
in answering and in Scoring. 


a. Discuss some of the Proposals fi 
break down into 


promotions, 


or aiding the gifted child. Sane 
Specific proposals, such as special classes, accelerate 

extra assignments, and the like.) . 

learning theories. (Hint: This topic 
labels—behaviorism, for example—or 
known theorists.) 
ustrial revolution. 


6. Point out any owing items: 


errors in the fol] 
1. The Frenchman who developed the first successful intelligence 
nn (2) Terman (3) Binet (4) Wundt 


one who is (1) strong (2) handsome 
(3) angry (4) pusillanimous (5) capable 


LTF Edgar Anderson Poe Wrote the poem “The Raven.” 


sis on the three R’s is not a serious defect 1? 
tional practice. 


ion of the Golden Rule will make for better 


on > 
H H 
3 
Moe 
79.9 8 
aoe 
258, 
La} oO 
Bog 
8 oo 
g. F 


oN 
4 
oI 
4 
=a 
oO 
3% 
oO 
(an 
D 
5 
o 
S 
D 
a. 
& 
4 
a 


ibution of scores is the midpoint, 
tkedly by very high or very Ney 
scores. 


Scoring the Essay Question Objectively 209 


(5) relax in an easy chair. 
11. The expansion of the binomial (a +b)? is ..... ee essere eee eee 
1. off ə 
12. I borrowed a book 2. off of my roommate. (Answer)... esse 
3. from 
13. We get the most calories per pound from 
(1) candy (4) potatoes 
(2) carbohydrates (5) proteins 
(3) vitamins 
14. When there is a fire drill, the teacher must make sure that her 
cing OE ERT NE aetna ODSEFVE eer eraa nena eee ences eee 
BO! & eenssirane ciate reed 40d. a 
15. T F The work of Freud has done much to demonstrate that 
associative connectio 
though under conditions of ev 
ally be recalled. _ 


eryday life they cannot usu- 


CHAPTER 9 


CONSTRUCTING THE OBJECTIVE TEST 


eds to know how objective mental 
uch the same reason that he needs 


says about his test when 
€ test items were selected and put to- 
rtant, perhaps, the teacher who knows 
ill be able to improve greatly the 
quality of the day-to-day tests which he makes for his own use- 


Columbia Research Bureau Spanish Test 211 


most schools. Standard printed tests in wide use today are made 
by testing bureaus. These agencies have a staff of experts in item 
writing and construction techniques,» technically trained assist- 
ants, and access to large and representative samples and to labora- 
tory and scoring equipment. The classroom teacher can hardly 
hope to match all this. And fortunately it isn’t necessary, since 
his test-making is properly on a much more modest scale. 

This chapter will outline the basic techniques in test con- 
struction. These methods apply whether the test is designed to 
measure intelligence, educational achievement, or aptitude. 


WRITING SPECIFICATIONS FOR THE TEST 


Before he begins to construct an examination, the teacher must 
decide what he wants his test to do. This means that he must lay 
down specifications for the test (page 105). Usually a teacher 
wants to test his students’ knowledge of the fundamentals of the 
subject and to see how well they can use this knowledge in 
solving problems. Three subject matter tests in different areas 
will be described in order to show what specifications the 
author had in mind and how he went about accomplishing his 


objectives. 


Columbia Research Bureau Spanish Test* 


This test is designed for high schools and colleges. Part I calls 
for basic knowledge of the language, and Parts II and III require 
understanding of language structure and application of rules of 
grammar. In ‘more detail, Part I is a vocabulary test of one hun- 
dred words in multiple-choice form. The student is instructed 


to mark that one of four or five English words which best de- 
rd. Part II is a language comprehension 


ve sentences in Spanish arranged in 
s to be read and marked “True” or 
ned with grammar and syntax. This 


fines the given Spanish wo 
test. There are seventy-fi 
order of difficulty; each i 
“False.” Part III is concert 


* Published by the World Book Company, Yonkers, N. Y. 


212 Constructing the Objective Test 


test consists of one hundred English sentences, each followed 


by an incomplete translation in Spanish, which the examinee is 
told to complete. 


California Arithmetic Test (Upper Primary, 
Grades 3 and 4)* 


This test is part of a comprehensive educational achievement 
battery, but it may be given as a separate examination. Its objec- 
tive is to test for skills in fundamental operations, the identifica- 
tion of consistently made errors, and the ability to apply what is 
known to the solution of problems. The eight sub-tests cover the 
four fundamental processes (addition, subtraction, multiplica- 
tion, and division), facility and skill in following directions in- 
volving numbers, and simple “mental arithmetic” problems. 


The Nelson-Denny Reading Test** 


The authors state the objectives of this te 
dict success in college, to enable a sectio 
high-school classes on the basis of readin: 
diagnosis of scholastic difficulties, The e 
two parts, a test of vocabula 
and understand fairly diffic 
words in the vocabulary test, 
one of which is to be marke 
test is made up of nine selec 


st as follows: to pre- 
ning of college and 
g skills, to aid in the 
Xamination consists of 
ry and a test of the ability to read 
ult prose. There are one hundred 
each word followed by five choices, 
d as correct. The paragraph-reading 


Seleeuons (Of approximately two hundred 
words each. Four questions are asked on each paragraph. There 
agraph. 


are five optional answers for each question, one of 
be selected by the examinee. It seems clear that the t 
basic knowledge of language as well as th 
knowledge intelligently. 


which is to 
est measures 
e ability to use this 


SELECTING ITEMS FOR THE TEST 
In the construction of an examination, both the content and 
the form of the question must be considered, 


* Published by the California Test Bureau, Los Angel, < 
** Published by the Houghton Mifflin Co., Boston Met 


Deciding On the Type of Item 213 


Deciding On the Type of Item 

The teacher must first decide what type of objective item 
he wishes to use. True-False and multiple choice are favorites 
for measuring basic knowledge, and multiple choice, matching, 
completion and essay recall are all used to assess understanding, 
interpretation and application. It is probably less confusing for 
the younger students if a sub-test or section contains only items 
of one type and does not switch from one kind to another. The 
test-maker should start with a much larger number of items 
than he plans to have in the completed test. All the questions 
should be read by other teachers of the subject and criticized for 
form and for content. Items judged to be trivial, inappropriate, 
ambiguous, or too narrow in scope should be revised or dis- 
carded, The items which survive this preliminary inspection 
should still number considerably more than the number of items 
planned for final use. An excess of items is necessary, since some 
items will always be discarded as a result of the item-analysis to 


follow. 


Arranging the Items in Order of Difficulty 

The questions are now arranged in a rough order of difficulty, 
from easy to hard. For the first try-out, the difficulty of an 
item as judged by several teachers is sufficient for placement. The 
test, as tentatively drawn up, 1$ NOW administered to a sample of 
students for whom the final test 1s intended—for example, to 
fifth-grade pupils or high-school freshmen. If several teachers 
of the subject co-operate—and thus increase the size of the 
experimental group—the final test will be a better examination 
than it will be if it is administered to a single small class. It is 
always advisable to get as much information as possible on each 
item. Hence, those examinees who take the examination in pre- 
liminary form should be urged to attempt every item, even when 
they are uncertain of the answer. The time allowance for the 
whole test should be generous, sO that every student will have 
time to try every item. This may make it necessary to have a 


second testing period. 


214 Constructing the Objective Test 


Setting the Time Limit 


The length of the time interval set for the test when put into 
final form will depend on the time available for testing—most 
often one period of about fifty minutes. Time allowances must 
always take into consideration the age of the pupils, type of 
item (amount of computation or reading needed in answering it), 
whether the test is primarily for survey purposes or for diag- 
nosis, and whether speed and/or power are deemed important. 
In examinations which are strictly power tests, the time limits 
should be long enough fo? all but the very slowest examinees to 
finish. Sometimes, naturally, an examination has to be cut in 
length in order to have it fit into the available time. 


ITEM ANALYSIS 


SA The two characteristics of an item which we need to know 


about in building a test are (a) difficulty and (b) validity, or 
discriminative power. These two determinants of an item’s good- 
ness are computed from the same tabulation of the test data.(Com- 


putation of the difficulty and validity of an item is called item 
analysis, 


Difficulty and Validity in Item Analysis 
The difficult 


inees in the tryout 


Biserial r in Item Validity 215 


Biserial r in Item Validity* 


The authors of most standard mental tests have used the biserial 
r method (or some approximation to it) in determining the 
validity of the items in their tests. By means of biserial r, we can 
compute the correlation between success and failure on a single 


item and size of total score on the test, cr on some other measure 


of performance taken as the criterion. The size of the correlation 
how well the item is working 


between item and test score shows 

together with other items—is a member of a team. Items‘unrelated 

to total score are discarded. i 
Steps in the determination 0 

are as follows: 


f item validity by use of biserial + 


1. Arrange the test papers in order for total score from highest 


to lowest. 


2. Count off the highest and lowest 27 per eent** of the papers 
—if not exactly, as nearly so as possible. If there are 120 children 
in the “standardizing group,” for example, put 32 in the top and 


32 in the bottom groups. 


3. Count off the number in the high group and the number in 
the low group who pass each item, and express these figures as 
percentages. Suppose, for cxample, that Item 18 1s passed by 60 
per cent of the high group and by 30 per cent of the low group. 
Then from tables prepared for the purpose,t we read that the 
biserial correlation between this item and the whole test is .31. 
For an item passed by 24 per cent of the high group and by only 
3 per cent of the low group, the biserial r is .44. In general, any 


see references at the end of Chapter 2. 


* For the computation of biserial 7 
senate e 27 per cent. When the distribution ©} 


** There are good reasons for choosing r n of 
ability is normal, the sharpest discrimination between extreme groups is obtaine 
when item analysis is based upon the highest and lowest 27 per cent in. ea 
case. When larger per cents are in the high and low groups, the reliability 
the determination is higher, but the difference between the two groups de- 
creases. On the other hand, when per cents In the high and low groups are 
smaller, reliability falls off but the difference between the two groups increases: 

+ See, for example, /te7 Analysis Table by Chung-Teh Fan, published by the 


Educational Testing Service, Princeton, N. J:, 1952. 


216 Constructing the Objective Test 


item with a biserial + of .20 or more can be taken to be valid if 
the test is fairly long. In a short test, items of higher validity are 
needed. Both hard items and easy items are valid (that is, have 
discriminative power) if they separate the high and low groups. 
An item passed by 15 per cent of the high group and only 1 per 
cent of the low group {a very hard item), for example, has a 
biserial + of .47, whereas an item passed by 92 per cent of the 
high group and 65 per cent of the low group (an easy item) has 
a biserial 7 of .39. Both are good items, though they differ greatly 
in difficulty. 


9 


4.{ Determine the difficulty of each item by averaging the per- 
centages that pass it in the high and low groups. An item passed 
by 60 per cent of the high group and by 30 per cent of the 


low group, for example, has a difficulty index of .45—that is, 
= + .30 


3 ) and an item passed by 15 per cent of the high and 


1 per cent of the low groups has a difficulty index of .08, This 
summary method of obtaining difficulties of items is not as 
accurate as is the practice of using the whole group, but it saves 
ume and is precise enough for most tests, 


5. It can be shown mathematically that items with difficulty 
indices of .50 or thereabouts are the best items, in the sense of 
being able to differentiate among the largest number of good and 
poor students.) Not Many items, of course, will be found with 
difficulty indices of exactly .50; the range of difficulties usually 
runs from above .90 to below .10. If the test is to cover a wide 
range of talent (and that is what is wanted in most school examina- 
tions), a good plan to follow in selecting items is as follows: 


Of items passed by 85-100 per cent (very easy) 

take about 15 per cent 
Of items passed by 50- 85 per cent (fairly easy) 

take about 35 per cent 
Of items passed by 15- 50 per cent (fairly hard) 

take about 


35 per cent 


Biserial r in Item Validity 217 


Of items passed by 0- 15 per cent (hard to very hard) 
take about 15 per cent 


All of the items should, of course, have satisfactory discrimina- 
tive power. Note that different proportions at the difficulty levels 
follow the normal distribution. e 
Items passed by 100 per cent or by robody have no validity 
in either case, but sometimes an author will place several very 
easy items at the beginning of the test for psychological effect, 
and a few very hard items at the end to test the very bright pupils.’ 


6. In using multiple-choice items, it is important for the im- 
provement of the examination to know to what extent good and 
poor students have chosen the various distractors. If the wrong 
answers are illogical, obviously absurd, or otherwise not very 
misleading, the examinee will have little difficulty selecting the 
right option. The item is easier than it would have been had the 
the misleads been more attractive. Information concerning the 
efficacy of misleads can be obtained by tallying the responses of 
the high and low groups to each mislead, as shown below. The 
group considered is the 120 children referred to in the illustra- 
tion above, and there are 32 (27 per cent) in the high and 32 in 
the low groups. The item is of the multiple-choice type with 
four options, and the correct answer is keyed as (b). 


Item 26 a ® c d Omissions Total 
High group 1 16 8 7 0 32 
Low group 3 7 10 12 0 32 


eds to be rewritten, since only 


It is clear that distractor (a) ne 
ise, item 26 differentiates be- 


four in sixty-four chose it. Otherw 
tween the good and poor students rather well. 

A second example shows a slighly different situation. Here (c) 
is keyed as the correct asnw¢r. 


b © d Omissions Total 


Item 10 a 
High group 0 15 11 5 1 32 
Low group 5 10 9 8 0 32 


218 Constructing the Objective Test 


i is the 
Mislead (b) is chosen by more of the good students than a : 
correct answer (c); and this is true, too, of the poor stu si 
Obviously, mislead (b) must be made less attractive or pe ise 
z ‘ es 
changed so that it doesn’t compete so strongly with (c). es <7 
more, (c) might be strengthened and (a) examined further 
see why it failed to attract any answers in the high group. 


FIGURE 9-1 Item Analysis Data for Test File 


FRONT OF CARD 


9 


litem 36: What marked change took place in the political status of 
India in the year 19472 


1. She received a mandate from the United Nations. 
2. She won her independence from Britain. 


3. Her People were united under Mohammedan rule. 
4. She joined the Arab League. 


BACK OF CARD 


Item 36: 1 2 3 Omissions 
High group: 10 32 6 


0 
low group: 19 LAI 13 2 


Sample: 200 high school seniors, 
Validity: biserial r — Al 
Difficulty: 


tested in Jene, 1953 


= 39 per cent 


7. Many teachers find it 
future use. A good p 
On the back of th 


contemporary history- 
r has accumulated a large file of items, tests of 


Biserial r in Item Validity 219 


approximately the same range and validity may be made up as 
needed. 


A SHORT METHOD OF ITEM ANALYSIS 


It is wise for a teacher to understand what the biserial co- 
efficient of correlation means and what it does, since the device 
is utilized in many standard tests and is frequertly mentioned in 
the literature of testing. At the same time, it is not necessary for 
the teacher to employ the method in order to construct good 
classroom examinations. The difference “between a simple count 
of “rights” in selected fractions of the best and poorest pupils 
will suffice as a measure of the validity or discriminative power 
of an item. First, the items should be gone over by several 
teachers, the unsatisfactory items discarded, and the remaining 
items arranged in order of difficulty, this determined by the judg- 
ment of the teachers reviewing the items. Next, the test as tenta- 
tively drawn up is administered to a sample of children drawn 
from the classes or age levels to be tested. From here on the 
steps are as follows: 


1. Arrange the test papers in order for size of total score, from 
the highest to the lowest. 


2. Count off the 25 per cent* of the best papers and the 25 per 
cent of the poorest papers. If the total group is small (for ex- 
ample, under fifty) take some larger proportion, say the upper 
half and the lower half. Suppose there are eighty, pupils in the 
experimental sample (try-out group), so that twenty, or 25 per 
cent, fall in the high group and twenty in the low group. Each 
item may now be examined to see whether it is able to separate 
these two criterion groups. 


3. Determine the number in each of the two criterion groups 
who answer each item correctly. If fifteen in the high group 
* Unless the biserial r method is used in determining validities, there is no 


need to observe the somewhat unwieldly 27 per cent rule; 25 per cent or any 
convenient larger percentage will serve. 


220 Constructing the Objective Test 


answer an item correctly, and five in the low group get the item 
right, the validity is 15-5, or 10, and the validity index is 10/20 
or .50.* If all twenty in the high group answer an item cor- 
rectly, and none of the low group gets it right, the validity of 
the item is maximal: 20 — 0 = 20, and the validity index is 20/20 
or 1.00. The lowest validity index of an item by this method 1s, 
of course, 0/20 or .00. Validity indices run, therefore, from 0 to 
1. There may be a few items of negative validity: more rights in 
the low than in the high group—but such items are rare. Items 


having zero or negativesvalidity must be rewritten before they 
are used or discarded if salvage is impossible. 


4. If Ra = number right in the high group and Ri = number 
right in the low group, the discriminative power of an item Is 
simply (Ru — Rz) or (Ru — Rx) /Rit When written as a validity 
index. Using the same nomenclature, we may write the difficulty 
index of an item as (Ru + Ri,)/ (Nn + Nz) in which Nu and Nu 
are the numbers in the high and low groups, respectively. In our 
example above wherein Re = 15 and Rr = 5, the validity 
index is 10/20, or .50, and the difficulty index is (15 + 
5)/(20 + 20), or .50, If Ru = 18 and Ri = 12, the validity 
index is 6/20, or 30, and the difficulty index is 30/40, or .75. 
Again, if Ru = 10 and Ry, = 2, the validity index is 8/20, or 
-40, and the difficulty index is 12/40, or .30, 


5. Select the items having the highest validity indices for the 
final test. Then follow the table on page 216 in apportioning 


items to the various levels of difficulty, if the test is to cover a 
fairly wide range of talent. 


; 6. It is advisable to examine the misleads when multiple-choice 
items are to be used. The method outlined on page 217 will aid 
in locating distractors which are too plausible or not plausible 


enough. The first kind are too often accepted, and the second 
are taken by only a few. 


__’ Validities can be left simply as the difference between the number right 
in the two extreme groups. The chief advantage of a validity index is to put 
validities in a percentage scale, as are the difficulties, 


Biserial r in Item Validity 221 


7. A card file of acceptable items will prove useful when a 
teacher wants to lengthen a test or to replace non-functioning 
items. When there are a number of items, a parallel form of the 
test can be drawn up. 

Table 9-1 shows the sort of data which we can expect to get in 
an item analysis of questions administered to a sample of 80, as 
described above. The full table, of course, would contain data 
on all the items and on all the members of the two criterion 
groups. Half of the scores in the high andsín the low groups are 
not shown in order to shorten the table, but these omitted scores 
are included in the totals upon which the item analysis is based. 
Each of the two criterion groups (the high and the low) consists 
of twenty examinees. 

Examination of Table 9-1 shows that Item 4 is highly valid and 
that Items 1, 2, and 5 are acceptable. Item 3 has no validity and 
must be dropped or changed drastically. An item with a validity 
index of .20 or more may be considered satisfactory—at least 
tentatively. This figure is arbitrary, however. If the test is 
shortened, the acceptable point for a validity index should be 
raised; if the test is lengthened, it should be lowered. Any item 
with an index larger than 0 has some validity and hence some 
value. Note that in Table 9-1 the difficulty indices range from 
.70 (a fairly easy item) to .30 (a fairly hard item). 


SCORING THE COMPLETED TEST 


| If the completed test is cast in T-F form, the point scores will 
be simply numbered right, or R — W if we wish to correct for 
guessing (page 191). In multiple-choice tests, the correction for 
guessing is 

7 W 

Score = R — w- 

where 7 = the number of choices or options, It is sometimes ad- 
visable to use the correction for guessing with T-F items, but 
number right without correction is satisfactory in multiple-choice 
when four or five options are provided. 


TABLE 9-1 


Item Analysis of the First Five Items of a Test Made Upon Two 
Criterion Groups, the Highest and the Lowest 25 Per Cent 


in Total Score. Ng = Ny = 20, and N = 80 
a canes ae 


Highest Group Total test score 
(Best 25 per cent) in order ITEMS 
In order of Merit of size 1 2 3 4 5 
1 72 v v v v y 
2 °70 o v o v 0 
3 ; 68 v v v v Vv 
4 65 v 0 0 v 0 
5 ` 65 v v v v v 
6 %5 v v v v y 
7 63 v 0 0 v 0 
8 61 v v v v y 
9 60 v v 0 v 0 
10 60 v v v v v 
20 54 o v 0 v v 
Rg = 


Lowest Group 
(Poorest 25 per cent) 


1 35 v v v v 0 

2 34 o v v 0 0 

3 30 v v v v 0 

4 30 V 0 0 0 0 

5 27 0 0 v 0 0 

6 25 v 0 v v ¥ 

2 25 0 0 0 v v 

8 24 v 0 y v 0 

9 23 v v 0 0 v 

10 23 0 0 0 v v 

20 12 0 v v 0 oO 

Rgs 8 10 10 8 4 
Ra —Ry= 7 6 0 12 4 
(Ra — Ry) /Ry = 35° 30 0 60 .20 
(Ra + Ry) / (Nu + Nz) = -58 65 450.7030 


Biserial r in Item Validity 223 


In most cases, it is sufficient for the teacher to express standing 
on the test in point scores or totals. If several classes have been 
tested and it is desirable to compare their performance, percentile 
ranks will be useful. Scaling of teacher-made tests in standard 
scores or normalized scores is not recommended unless the test 
is to be used throughout a school system. 

Directions for the final test should be explicit, and time limits 
should be given. Manuals for standardized tests may be consulted 
with profit for pointers on directions sd time limits. A test 
should not be so long that most students cannot finish in the time 
allowed. 

The use of scoring stencils will speed up marking when many 
papers are to be examined. In T-F tests, a strip containing the 
answers (a key) may be laid alongside the left-hand margin and 
the answers checked as right or wrong, or simply the right 
answers checked. Separate scoring sheets are useful in dealing 
with multiple-choice and matching items. Spaces are numbered 
on the answer sheet for recording answers to the questions on the 
test. The test blank itself is not marked and may be used more 
than once. 


THE RELIABILITY OF THE COMPLETED TEST 


[Perhaps the easiest method of estimating the reliability of a 
teacher-made test, since there is rarely more than one form, is 
by what is called the “split-half” technique. In this procedure, 
the test is administered only once to a sample of examinees, and 
is then divided into two half-tests. The first half-test contains 
the odd-numbered items (1, 3, 5, and so on) and the second half- 
test the even-numbered items (2, 4, 6, and so on).* The correla- 
tion between scores on the two half-tests is now found and 
from this r the correlation of the whole test with itself (its self- 
correlation) is predicted by the well-known “prophecy 

* Note that when a test is split into odd and even items, the range of diffi- 


culty in the two half-tests is the same and the split is unique. Not just any split 
into two half-tests is satisfactory. 


224 Constructing the Objective Test 


formula.* To illustrate, suppose that in a class of ten rigan 
: È s È A a 
graders, an English Literature test in multiple-choice form a 
an odd-even correlation of .50. What is the probable se 
correlation of the whole test? The prophecy formula is 


2Xr (half-test) 
1 + r (half-test) 
Substituting r = .50 for the self- 
have that 


r (whole test) = 
correlation of the half-test, we 


2X .50 or .67 

TSO ) 
ry reliability coefficient (.67) for a single class. / 
For standardized tests administered to very large groups of a wide 
range of talent, reliability coefficients will ordinarily be higher— 
-90 or more. For teacher-made tests, however, the reliability co- 
efficients will rarely be more than .60 to .70. Reliability is higher 
over several grades—that is, when the test is given to more than 
one grade. The standard error of a test score can be computed by 


the formula given on page 29, but for the teacher-made test 
this is often a needless refinement. 


Reliability coefficients for a tea 
be computed from a 7 
determining the validi 
the standardization gro 
the selection of items 
low members of the sar 


r (wholé test) = 


This is a satisfacto 


cher-made test should always 
ew class, never from the sample used in 
ties of the test items. Self-correlation in 
up will always be spuriously high, because 


was based on the scores of the high and 
mple. 


VALIDITY OF THE COMPLETED TEST 


A teacher-made test in physics or French, for example, will 
always have content validity, even when the sampling is quite 
narrow. Teacher-made tests rarely cover as much material as do 
the standard printed tests. An approximate measure of validity 


for a test can be found by correlating test scores against school 
* The Spearman-Brown Prophecy formula is treated in all standard texts 
dealing with statistical method in Psychology and education. 


Biserial r in Item Validity 225 


grades in the same subject. This method is not entirely satis- 
factory, since school marks are rarely more dependable measures 
of the subject matter than are the tests. When experimental 
validity is attempted by correlating scores on a teacher-made test 
with grades or with other test scores, a new group must always 
be utilized. Such validation, called cross validation, is necessary 
because the group used in item analysis is a special group which 
` has served to select the items in the first place. Cross validation is 
necessary also when the two criterion groups (the upper and 
lower extreme groups) are selected on the basis of school grades. 
A teacher-made test will of necessity correlate with grades 
achieved by this group, since the group selected the items. 
\Perhaps the best way to judge the value of a teacher-made test 
is by its predictive validity. If the test aids the teacher in getting 
a better notion of the individual differences within the class, and 
leads to better understanding of the difficulties of the students 
(meager knowledge, wrong knowledge, and so on), it has ful- 


filled its purpose. | 2 
SUGGESTIONS FOR FURTHER READING 


Bean, K. L. Construction of Educational and Personnel Tests. New 
York: McGraw-Hill, 1953. 

Noll, V. H. Introduction to Educational Measurement. Boston: Hough- 
ton Mifflin, 1957. 

Ross, C. C., and Stanley, J. C. Measurement in Today’s Schools (3rd 
edition). New York: Prentice-Hall, 1954. 

Travers, R. M. How to Make Achievement Tests. New York: Odyssey 
Press, 1950. 


SUGGESTIONS FOR LABORATORY WORK 


1. Assume that you have tried out 50 T-F items on a class of 40 pupils. 
Draw up a table like that of Table 9-1 showing how you would carry 
out an item analysis. 

2. If time allows, construct a test using your class as standardizing sam- 
ple. Multiple-choice items in arithmetic and vocabulary taken from 
E. L. Thorndike’s The Measurement of Intelligence may be used con- 
veniently. Thorndike’s book gives items by levels over a wide range of 


226 Constructing the Objective Test 


difficulty. Administer a test of about fifty items and item-analyze the 
results by the method given on pages 219-223. 

3. Take a test which has been given to this class or to some other class. 
Analyze the questions for validity, following the method on page 222. 


QUESTIONS FOR DISCUSSION 


1. A sixth-grade teacher has administered a test in fundamentals of 
arithmetic. What arialyses of the test data could this teacher make which 
would (a) help his futu.* teaching, (b) be of value to individual pupils? 


2. Under what conditions would it be profitable to correct scores on 
a multiple-choice test for gWessing? 
3. In some schools, one teacher mak 
subject. What are the adv 
4. What might an item 


es all the examinations in a gyen 
antages and disadvantages of this prosedure: 
of negative validity mean? Of zero validity? 


CHAPTER 10 


SOME PROBLEMS IN THE EVALUATION 
OF TEST SCORES 


Interpreting Multiple Aptitude Test Scores 

Table 10-1 gives the scores achieved by ten ninth graders on 
the Differential Aptitude Tests (DAT). Scores on any mental 
test are more meaningful when supplemented by the pupils’ 
school grades and by a knowledge of personality traits, interests, 
and ambitions. With this proviso in mind, it will be interesting 
to answer the questions below with references only to the per- 
centile ranks in Table 10-1. 


QUESTIONS ON TABLE 10-1 


1. Which two students show the poorest scholastic ability? In 
what jobs might they do best? 


227 


‘13 pue sAoq 10} surrou ayeaedas , 


th os 9b oz SE o£ cw 8I SEWON T, CRIT 
pE 97 [$3 16 +s TE 8£ bE Avag quel yy 
06 $6 T8 tt 09 08 98 £8 Yory sino] 
09 os 97% 08 £9 +8 88 06 ATID] YUL] 
98 08 $8 $6 z6 06 z6 96 qourery 20f 
ob FE +9 7 SS 08 OL 58 MH emne 
06 OL $6 EZ Ts +9 £t ES auer ong 
If of Lt 99 £L 8t +9 zE yonpoog əuef 
u 92 8I 16 06 z6 96 $6 spreanpy Arse] 
oE fb 19 óI Eg $7 or LI Surung Arepy 
sa2uajuas Buryjjeds_ = A2pino2y po = Buyuospay suoHDjay Buruospay Angy Bujuospay soupy 
əßpsn ə6onGuD] pəədş joua) jo21uDyow aopds poysqy prawn IPq13A „S4uƏPNS 


SMd 940 S340IÇ 
(SSP) @PH1N-YsUIN) S231025 apnyydy joyUaIayIG 


L-OL J1dVL 


Case Studies in Evaluation of Abilities 229 


2. Which student exhibits the most consistently high level of 
ability? 

3. Which students are likely to have reading difficulties? 

4. Which girl should do well in secretarial work? 

5. If Joe Kramer wants to go to college, would you encourage 
him to plan to go into engineering? 

6. Would you encourage Larry Edwards to go into his father’s 
accountancy firm after graduation from high’school? 

7. Jane Goodrich plans to become a medical technician. Would 
you recommend this vocational goal? | 

8. Which students will probably find it hard to graduate from 
high school? 

9. Frank Seay’s father is an auto mechanic and Frank is inter- 
ested in this work. Do you think it a wise vocational choice? 

10. Is it likely that several students are handicapped by poor 
spelling and language usage? Why? 


Case Studies in Evaluation of Abilities 

The three case studies which follow provide considerable data 
about three pupils, two in high school and one in elementary 
school. Questions are planned to focus upon things to look for in 
evaluating the promise of the pupils being considered. 


I. Case Study of Robert T. 

Robert is 16-2, a sophomore in high school. He is well-grown 
and makes a good appearance. He is well behaved, quiet, inter- 
ested, though not as a participant, in sports, and does not read 
much. Robert’s father is a house painter; both parents are high- 
school graduates. Robert wants to go to college, and is encour- 
aged to do so by his parents. He wants to be an engineer. 


School Data 


Ninth Grade Tenth Grade (First term) 
English Cc English G 
Social Studies B Social Studies D 
Mathematics B Physics B 
General Science B French D 
Physical Education C Physical Education C 


230 Some Problems in the Evaluation of Test Scores 


Test Data 


Otis Quick Scoring (Form Gamma) IQ 112 
California Mental Maturity (Language) IQ 110 
California Mental Maturity (Non-language) 1Q 121 
Cooperative General Achievement Test: Percentile Ranks 
I. Social Studies 38 
II. Natural Science 52 
II. Mathematics 36 
Kuder Preference Record (Vocational) Percentile Ranks 
Mechanical » 63 
Computational 51 
Persuasive 15 
Artistic 12 
Literary 46 
Musical 51 
Social Service 20 
Clerical 50 
Scientific 72 
1. What do 


you think of Robert’s chances of succeeding in 
college? 


2. Robert’s interests are in the mathematics-science area; are 


they Strong enough for him to plan engineering as a vocation? 
3. Robert’s school record is weak in English and Social Studies, 
and his interests do not lie in persuasive and artistic fields. What 


5. How do you interpret the 
language and non-language IQ’s? 


6. Would you say that Robert’s school grades are not in keep- 
ing with his IQ? 


7. Do you think Rober 
nician than as an engineer? 


8. Would you recommend that Robert become a salesman? 


. ’ 
difference between Robert’s 


t might be more successful as a tech- 


O S 
o OO TIETES vice = a TTT ere eT eTe 


Case Studies in Evaluation of Abilities 231 


9. Do Robert’s interests jibe with his achievement test records? 
With his school marks? 
10. Might Robert do well as an airplane pilot? 


II. Case Study of William S. 

William is 18-1, a senior in high school. He makes a good 
appearance, is husky and muscular. William is easy-going and 
affable; he likes to hunt and is interested in, and good at, sports. 
His fatter 3 is a successful lawyer and his, mother is a college 
graduate interested in club activities. The ` parents have planned 
for William to study medicine: his grandfather was a well- 
known physician in the community. William has accepted these 
vocational plans but says he is more interested in business and 
sales work. 


School Data 
Tenth Grade Eleventh Grade 
English B English C 
Social Studies B Social Studies B 
Mathematics D Mathematics D 
Physics C Spanish B 
Physical Education B Physical Education B 
Test Data 
Terman-McNemar Test of Mental Ability IQ 118 
California Achievement Tests (Advanced) Percentile Ranks 
Reading 65 
Mathematics o 40 
Language 60 
Differential Aptitude Test (Tenth Grade) Percentile Ranks 
Verbal Reasoning 86 
Numerical Ability 42 
Abstract Reasoning 38 
Space Relations 40 
Mechanical Reasoning 32 


Clerical Speed and Accuracy 55 


232 Some Problems in the Evaluation of Test Scores 


Language Usage—Spelling 75 
Language Usage—Sentences 93 
Kuder Preference Record (Vocational) Percentile Ranks 
Outdoor 96 
Computational 32 
Persuasive 90 
Artistic 40 
Literary | 86 
Musical *s., 54 
Social Service , 36 
Clerical 40 
Scientific 26 


. Do you think that William is college material? 

- Would you encourage him to plan for medicine as a career? 
- Do William’s grades verify his DAT scores? 

4. Is language a strong area for William? Would you on the 


strength of this, suggest some other vocation than medicine for 
William? If so, what? 


5. What are William’s stron 
Kuder Record? 


6. Are William’s achievement test scores in line with his 
school grades? 


7. Do you think William mi 
ful in business? 
answers. 


w Ne 


g interests, as revealed by the 


ght be happier and more success- 
Or in the study of law? Give reasons for your 


8. William’s 1Q does not jibe with his DAT scores. Can you 
give any reasons why this should be so? 
9. The Kuder scores are more helpful than the DAT in 
counseling William. Would you agree with this judgment? 
10. How would you explain to William’s father his consider- 


able variability in scores? And how would you explain the 
apparent contradictions? 


III. Case Study of Mary S. 
Mary is 11-8, in the second half of the sixth grade. She is 


Sociometric Testing 233 


pleasant and well mannered, but is judged by her teachers to be 
“nervous” and overanxious. Mary wants to be a teacher. Her 
father is an auto salesman, with high-school education; her 
mother is a housewife, with junior-college training. There are 
three other children in the family. 


School Data c 
Fifth Grade Sixth Grade 
Reading C Reading B 
Social Studies C Social Studies C 
Arithmetic B Arithmetic C 
Science C Science C 
Language C Language B 
Test Data 
Kuhlmann-Anderson Intelligence Tests IQ 110 
Metropolitan Achievement Tests Grade Equivalents 
Reading 6.1 
Vocabulary 6.8 
Arithmetic Reasoning 5.4 
Arithmetic Comprehension 5.2 
English 6.2 
Spelling 6.6 
History 5.6 
Science 4.6 
1. In what subjects is Mary weakest? 
2. Would you encourage her to plan for teaching as a career? 
3. Is Mary college material? Give reasons for your opinion? 
4. Could Mary do office and clerical work successfully? 
5. Would it help to have a Stanford-Binet IQ for Mary? Give 


reasons for your answer. 


Sociometric Testing 


From observations in school and out, most teachers get a fairly 
good idea of the social and personal relations within their class- 


234 Some Problems in the Evaluation of Test Scores 


rooms. They soon come to know 
which are well liked and popular, which are disliked or ignored, 
and which are picked on and teased. It is sometimes valuable for 
a teacher to have, in addition to his Own opinion, some measure 
of the attitudes and feelings of the pupils regarding each other. 
When data of this sort are collected systematically, they may be 
put into a table or expressed in the form of a sociogran. This 
last is a pictorial or graphic representation of the interpersonal 
relations within some specified group, often a class. 

The usual procedure is to ask the pupils to designate the class- 
mate by whom they would rather sit, or the child (or children) 
with whom they would prefer to play ball at recess, or to make 
some other choice of a companion in a real life situation. Table 


which children are leaders, 


TABLE 10-2 
Sociometric Tabulation 


CHOSEN 
David Anita Sally Gary Karl 


Janet Jack Helen Laura Ruth 
ee 


il 2 


Anita 

Sally 

Gary 1* 

Karl 1 

Janet 1* 
Jack Qe 
Helen T 


CHOOSER 


Laura 1 
Ruth 


Ist r 
Choices 3 2 0 


2nd 
Choices 1 0 0 


Sociometric Testing 235 


10-2 shows the responses made by ten fifth-grade pupils when 
asked to nominate their first and second choices of a child to 
work with on a class project. (The table reproduces only part 
of the data for a class.) 

A first choice is shown by a 1 under the name of the child 
chosen, a second choice by a 2. An asterisk (*) denotes that the 


FIGURE 10-1 Sociogram for 21 Kinderggrthers, 13 Boys and 


8 Girls 
5 
Group -Kindergarten 
Number- 21 
Boys -13 Jeff 
Girls -8 ne 


Strong (3) choice= ——>; Reciprocals = 4—#—; Partial reciprocals = ¢—+—» 


From Northway, Mary L., and Weld, Lindsay, Sociometric Testing. Reproduced 
by permission of the University of Toronto Press. 


236 Some Problems in the Evaluation of Test Scores 


choice for first place was mutual. Thus, David chose Gary and 
was chosen by Gary, and Anita and Janet each named the other 
as first choice. The summary at the bottom of the table shows 
David to be the most chosen child, with three firsts and one 
second. Janet is the next most popular, with three firsts. Sally is 
chosen by no one, and three of the girls (Helen, Laura and Ruth) 
and one boy (Jack) receive no first choices. Tabulation of the 
responses as giveti by the children will provide the “choice 
information that the teacher wants. . 

A more striking method of representing the social relations 
within the group is afforded by the pictorial sociogram shown in 
Figure 10-1. The responses were those of twenty-one, Kinder- 
garten children—thirteen boys and eight girls. The stars (pop- 
ular, often chosen children) are quickly located as are also the 
isolates, whom no one chooses. The two-headed arrows indicate 
mutual choices. 

When used wisely, a sociometric test c 
teacher, especially when the class is 
observation. Some of the thin 
are the following: 


an be helpful to a 
too large for close personal 
gs which a sociogram may reveal 


1. Good and bad personal relations, free interchange of 
choices, or the existence of cliques. 


2. Clusters and cleava 


ges resulting from differences in race, 
religi 


on, sex, and economic conditions of families. 


3. Differences between in-school and out-of-school social 
groupings, 

The sociometric method has some disady 
more harm than good if the morale of the cl 


poor discipline, frequent change of teachers, or other disrupting 
influences. For example, choices may be trivial or deliberately 
false, or some pupils may take the “test” as an occasion to express 
hostility and resentment against other pupils or against the 
teacher. Moreover, the choices of young children are often fleet- 


antages and may do 
ass is low because of 


Sociometric Testing 237 


ing, vary from time to time, and are quite unreliable. The socio- 
gram, therefore, is not foolproof. At the same time, in the hands 
at a skillful teacher, sociometric testing will often provide new 
insights into the personality traits of pupils and thus aid in 


discipline and in remedial work. 


APPENDIX A 


STATISTICAL SUPPLEMENT 


In order to understand and use test?results wisely, a teacher 
should be familiar with those statistical concepts most often em- 
ployed in mental testing. One of the best ways to accomplish 
this is to work through the computation of the basic statistics. In 
Chapter 2 a number of statistical terms were defined and their 
application illustrated. In subsequent chapters these statistics have 
been frequently employed. If, when a statistic is first mentioned, 
the student will work through its derivation—for example, the 
tabulation of a frequency distribution or the computation of 
an r—the value of the statistic to mental testing will be clarified. 
A second or even a third review is often helpful. A good analogy 
here is the habit of looking up unfamiliar words in a dictionary. 
Sometimes a word must be looked up more than once before its 


meaning is clearly grasped. 
This Appendix deals with the following topics: 


The Frequency Distribution 

The Frequency Polygon and the Histogram 
Averages: Mean, Median, and Mode 

Measures of Variability: Range, Q, and SD (o) 
The Coefficient of Correlation 


Drawing up a Frequency Distribution 


Test scores are more readily dealt with when they have first 
been organized into a frequency distribution. Suppose that Miss 


239 


240 Statistical Supplement 


Norton has administered a standard test to her class of forty 
pupils in social studies, and that scores are as follows: 


37, 38, 36, 31, 28, 33, 24, 19, 25, 34 
16, 43, 22, 20, 26, 44, 27, 19, 25, 34 
33, 24, 22, 20, 44, 27, 31, 28, 38, 17 
31, 26, 34, 17, 19, 20, 22, 24, 26, 29 


Table A-1 shows these forty scores tabulated into a frequency 
distribution in which the interval is five score units. Steps in 
setting up a frequency distribution follow: 


TABLE A-1 
Frequency Distribution of Forty Scores on a Social Studies Test 


Intervals Midpoints Tallies f 
40 - 44 42 Ill 3 
35 - 39 37 UII 4 
30 - 34 32 THE Ill 8 
25 - 29 27 TH THE 10 
20 = 24 22 THON 9 
15 - 19 17 Hh | 6 

40 


(1) Determine the range, or the gap between the highest 
and lowest scores. Examining our set of forty scores, we find the 
range to be from 44 to 16, or 28. 

(2) Select an interval which will be convenient for t 
tion. A good werking rule is to take a grouping unit whi 


yield from five to fifteen intervals. This rule may have to be 
broken when the sam 


ple is very large (200 or 300, say) or very 
small (less than 25), 


(3) Divide the range by the interval size tentatively chosen. 
This gives the approximate (within one) number of i 
In Table A-1 the range of 28 divided by five gives 5.6, 
number of intervals is six. Five is a better choice than is 
or smaller unit. For example, an interval of three will sp 


abula- 
ch will 


ntervals. 
and the 
a larger 
read the 


The Frequency Polygon 241 


data out too thin (into ten intervals), whereas an interval of ten 
crowds all forty scores into three intervals. 

(4) Write the beginning and end of each interval as a score: 
for example, 15-19. Actually a score of 15 represents the interval 
from 14.5 to 15.5—that is, a distance along an ability scale; and 
19 represents the interval 18.5 to 19.5. Hence, the lowest intervai 
begins at 14.5 and ends at 19.5; the second interval begins at 19.5 
and ends at 24.5, and so on. Writing score limits instead of actual 
limits saves time and avoids the confusion which often arises 
when one interval ends and the next begins with the same figure. 

(5) Tally each score under its proper interval as shown in 
Table A-1. Then write the sum of the tallies opposite each 
interval under f (frequency). Sum the f’s to give N. 

Note that the midpoint of the topmost interval is 42—that is, 
2.5 from 39.5 and 2.5 from 44.5. The midpoints have been 
entered in the second column. When scores have been arranged 
into a frequency distribution, all of the f’s within a given interval 
are represented by the midpoint of that interval. 


The Frequency Polygon 


Figure A-1 shows the frequency polygon of the forty scores 
tabulated into a frequency distribution in Table A-1. Two axes, 
a horizontal or X-axis and a vertical or Y-axis, are drawn at right 
angles. Score intervals are laid off at regular distances along the 
X-axis, or baseline, beginning with 15, the lower limit of the 
first interval. The six scores on the lowest interval are represented 
by a point six units up on the Y-axis and just above 17, the mid- 
point of interval 15-19. The nine scores on the next interval 
are represented by a point nine units up on F and just above 22, 
midpoint of the interval 20-24. The other f’s are drawn in 
in the same manner. 

When all of the points are joined by short straight lines, we 
have the outline of the frequency polygon. To complete the 
figure—that is, to bring it down to the baseline at each end— 


two intervals are added, one (10-14) at the low end and other 


242 Statistical Supplement 


FIGURE A-1 Frequency Distribution of the Forty Scores in 
Table A-1 


-y- 
(Frequencies) 


HH 

10 20 30 40 50 
=k 
(Scores) 


(45-49) at the high end. The f on each of these intervals may 
be taken as 0, and hence the fre 


quency polygon reaches the 
X-axis at 12 and 47. 


In order to provide a symmetrical figure—one which is neither 
too squat nor too thin—units in X and ¥ must be carefully 
chosen. A good rule is to sele 


ct units which will make the height 
of the figure about 2/3 of its length. In Figure A-1 the maximum 
f (10) is about 2/3 the baseline length of the polygon. 
The Histogram 


The frequency distribution of Table A-1 is again represented 
in Figure A-2, this time by a histogram, 
The main difference between the frequen 


histogram is that in the histogram the f 
small rectangles whose height equals the f’s on the intervals. In 
Figure A-2, for example, the height of the first rectangle is 6, 
its width being the length of the interval 14.5 to 19.5. Each 


or column diagram. 
cy polygon and the 
s are represented by 


The Mean (M) 243 


FIGURE A-2 Histogram of the Forty Scores in Table A-1 


FE E: HHE 
: EEHEHE 
! TEE 
= f eeevarertrerd 
@ 10H 2 ir +H EH 
| E HHHH m Hi oe HE i 
> S FHE 
|E EEEHIEEEHIEEEHIEEH 
E s EETEHI HH ii 
s HIH ai Hi 
cit ive 
15 25 35 45 
— 
Scores) 


frequency rectangle begins at the actual lower limit of the inter- 
val and ends at the actual upper limit. The histogram presents 
the same facts as the frequency polygon and there is often little 
to choose between them. When two or more frequency distribu- 
tions are represented on the same axes, however (as for example, 
the scores of two classes or two sections of the same class), the 
frequency polygon is to be preferred to the histogram, because 
the vertical and horizontal lines in a histogram coincide and are 


often difficult to disentangle. 


COMPUTATION OF AVERAGES 


erages in common use: the mean, the median 


There are three av 
and the mode. 


The Mean (M) 


We have defined the M on page 20 as the statistic found by 
dividing the sum of the scores by their number. When scores 
are put into a frequency distribution, the scores classified within 
any interval lose their identify and are represented by the mid- 
point of that interval. This necessitates a slightly different pro- 
cedure from that used with unorganized scores. 


244 Statistical Supplement 


In Table A- 2, section A, the midpoint of each interval is 
multiplied or “weighted” by the frequency which lies opposite 
it in the f cdliimn.. This gives the fX column and the sum of 
this column (1100 in Table A-2) divided by N (40) gives a 
mean of 27.50. The formula is 


SAX 
M -= SNA 
“TABLE A-2 


Computation of the Mean from a Frequency Distribution 
Data are the forty scores in Table A-1. 


A. LONG METHOD 


Intervals Midpoints t fX 
40 -— 44 42 3 126 
35 — 39 37 4 148 
30 — 34 32 8 256 
25 - 29 27 10 270 
20 - 24 22 9 198 
15 - 19 17 6 102 
N = 40 1100 

SfX 1100 

M = — = — = 27,50 
N 40 


B. ASSUMED MEAN METHOD (SHORT METHOD) 


Intervals 
40 — 44 
35 - 39 
30 — 34 
25 - 29 
20 - 24 
15 - 19 


Midpoints 


42 


x fx’ 

3 te 

2 8 

1 8 

0 29: 
il —9 


The Mean (M) 245 


where sfX is the sum of the products f X X and N is the 
number of cases. 

AL can always be computed by the “Long Method” just de- 
scribed, but it is generally computed by the Assumed Mean, or 
“Short Method.” When N is large, the Short Method reduces 
calculation and saves time. Moreover, the Short Method is man 
datory when standard deviations and coefficients of correlation 
are later to be computed from the same data.\Computation of M 
by the Assumed Mean or Short Method is shown in Table A-2, 
Section B. Steps are as follows: 8 

(1) Assume a mean, called the AM, near the center of the 
frequency distribution and if possible on the interval having 
the largest f. In our example, the AM is taken at 27, midpoint of 
interval 25-29, and this interval also has the largest f. 

(2) In the column x’* lay off deviations from the AM of 27 
in units of interval. The midpoint of interval 30-34—that is, 32— 
deviates five scores or one interval from 27; and the midpoint 
of interval 35-39 deviates two intervals from 27, and so on. 
Below the AM, the deviations of the midpoints of the two inter- 
yvals—22 and 17—are —1 and —2. The midpoint of the interval 
25-29—that is, 27—is the assumed mean, and 0 is entered in 
the x’ column opposite this interval. 

(3) Multiply each x’ by its f and enter the product in the fa’ 
column. The sum of this column is 4—25-21—from which the 


correction (c) is calculated. The forrnula is . 
Sfx’ 
Ce 
N 7 


and c = 4/40 or .10 in our problem. 
(4) Multiply c, the correction in units of interval, by the 


length of the interval or i, to give ci, the correction in score units. 


In our example, ci = 10 «x 5 = .50. 
(5) Add the correction, ci, to the AM to get M. In Table A-2, 


* w denotes the deviation of a midpoint from the AM; that is, x” = Madpt. — 


AM. Deviations from M are denoted by x. 


246 Statistical Supplement 


section B, the M = 27.00 + .50, or 27.50, thus checking the 
computation in A above. 


The Median, or Mdn 


The median is defined as that point in the distribution below 
which and above which lie 50 per cent of the distribution. The 
median is also described as the fiftieth percentile (Pz) and the 


TABLE A-3 


Computation d} the Median and Q from a 
Frequency Distribution 
Data are the forty scores in Table A-1 


Intervals f 
40 — 44 3 
35 — 39 4 
30 — 34 8 
25 - 29 10 25 
20 - 24 
15 - 19 2 a 
6 6 
N = 40 
N/2 = 20 N/4 = 10 3N/4 = 30 
-1 
By formula, Median = 24.5 + (= *) 
10 
= 27.0 
By formula, Q3 = 29,5 +. (AS) 
8 
= 32.63 
By formula, Q; = 19.5 + (= = *) 
9 
= 21.72 


32.63 — 21.72 
Q 


The Mode 247 


second quartile (Q2). Computation of the median in a fre- 
quency distribution is shown in Table A-3. (The Q, or quartile 
deviation, which is found in the same way as the median, is 
also computed in the table.) Steps are as follows: 

(1) Take % of N and count into the distribution from the 
low end until the interval containing the median is reached. In 
Table A-3, N/2 = 20, and counting into the distribution from 
interval 15-19, we locate the median on „interval 25-29. The 
two lowest intervals contain 6 + 9 or 15 f’s, and it is clear from 
this cumulated f that the twentieth sqore must fall on interval 
25-29. 

(2) Apply the following formula: 


Mdn =1 +i (X24 — om ft) 
fin 
in which 
l = lower limit of interval on which Mdn lies 
N/2 = % of the number of scores 
cum fi = sum of scores on intervals below | 
fm = frequency on the interval containing the Mdn 


length of the interval 


ll 


i 
In our example in Table A-3, 1 = 24.5, lower limit of interval 
containing Mdn; N/2 =-20; cum fi = 15; fm = 10; i = 5. 
Substituting in the formula, we have 


(20 — 15 
Mdn = 24.5 + 5 T 


27.00 
The median can be found by counting into the distribution from 
either end, but it is generally easier to start at the low end. 


The Mode 


The mode is usually taken as the midpoint of the interval 
which contains the largest f. In Table A-3, the mode is simply 27, 
the midpoint of the interval 25-29. This “midpoint” mode 1s 
often called the crude mode. The mode may be calculated more 


248 Statistical Supplement 


accurately, but since it is usually a preliminary statistic it 1s 
hardly worth while to do so. 


COMPUTATION OF MEASURES OF VARIABILITY 


The means or medians of two distributions are often the same 
or nearly so, but the spread or scatter of the scores around the 
central point is quite different. One class, for example, may 
show the same mean but a much greater range of talent than 
another. Knowing the variability of performance within a class 
may be more useful than knowing its average or typical per- 
formance. 

There are three measures of variability all of which are used 
in mental testing: the range, the Q and the SD (c) 


The Range 


The range is the gap between the smallest and largest scores. 
T 


he range is a useful statistic, but is often a rough measure. It 
is least efficient when ther 


either very large or very 
gap of 20 points between 
next below it. Then if the 
outstandingly high score y 
We had occasion to find 
distribution (page 240). 


e are several outstanding scores— 
small. For example, suppose there is a 
75, the highest score, and 55, the score 
lowest score in the set is 25, the single 
vill increase the range from 30 to 50. 
the range in constructing the frequency 


The Q, or Quartile Deviation 


Q, the quartile deviation, is defined as one-half the distance 
between the seventy-fifth and twenty-fifth percentile points in 
the distribution. To find these two percentiles, we must count 
into the distribution as we do to find the median. In Table A-3, 
for example, we count off % of N to get Qs (the third quartile 


or seventy-fifth percentile) and % of N to reach Q: (the first 
The formula for Qs is 


quartile or twenty-fifth percentile), 
IN (Be a 
Q=1 + i( /4 a) 


fm 


The Standard Deviation, SD or o (sigma) 249 


and the formula for Q1 is 


Gai +4 on = cum "i 
fa 
in which 
1 = lower limit of the interval upon which the 
quartile point falls 
i = the interval 
cum fı = cumulated f’s up to the interval containing 
the quartile wanted 
fm = f onthe interval which*contains the quartile 


In Table A-3, 34 of N is 30. Counting into the distribution 
from the low end, twenty-five scores take us to 29.5, lower 
limit of 30-34, which is the interval containing Q;. The f on 
this interval is 8. Substituting in the formula, we have 


Q: = 29.5 4 (2 = a 
8 
= 32.63 


To obtain Q; we count off 4 of N or ten scores as shown 
in Table A-3. Six scores take us to 19.5, lower limit of the 
interval 20-24, the interval which contains Qı. The f on this 
interval is 9. Substituting in the formula, we have 


Qi = 19.5 + (2 — 6) 
= 19. 5 
1 z] 9 

= 2L72 


From the two quartile points, Qs and Qı, we find Q by sub- 


stituting in the formula 


(Qs — Qı) 
Ca 
5 32.63 — 21.72 p 
and in our example, Q = 5 ), or 5.46. 


The Standard Deviation, SD or 7 (sigma) 


The standard deviation, or o, is a measure of variability com- 
puted around the mean; hence it is usually calculated from the 


250 Statistical Supplement 


same frequency distribution as the mean. SD, or ©, is the most 
stable measure of variability within a group and so is regularly 
used in research problems which involve correlation and in- 
ference. The computation of o from a set of ungrouped scores 
was outlined on page 22. Calculation of the SD from a fre- 
quency distribution requires a somewhat different procedure. 
The method is illustrated in Table A-4 for the same forty scores 
tabulated in Table A-1. Steps are as follows: 


. TABLE A-4 
Computation of the Standard Deviation (o) from a 
Frequency Distribution 
Data are the forty scores in Table A-1. 


Intervals f w fx’ fx’? 
40 - 44 3 3 9 27 
35 — 39 4 2 8 16 
30 = 34 8 1 8 8 
25 - 29 10 0 +25 0 
20 - 24 9 =l —9 9 
15 - 19 6 —2 —12 24 
N = 40 —21 84 
If 4 
AM = 27.00 sst o fa 10 c? = 01 
N 40 
aix A 5, 0 5 446 
= C= — — 0l = x I 
N 40 
oO = 7.23 


(1) Find the deviation 
as was done in Table A- 
—1, —2—that is, in units 

(2) Multiply each 4” 
column. 


(3) Multiply each x’ and its corresponding fx’ entry to give 


the entries in the fx’? column. For example, 2’ = 3 times fx’ = 9 
gives 27 as the fx’? entry. 


(x’) of each midpoint from the AM, 
2. Enter these figures as 1, 2, 3, 0, 
of interval in the x” column. 


by its f to give the entries in the fx’ 


The Standard Deviation, SD or a (sigma) 251 


(4) Sum the fx’? column to give Sis. 

(5) Compute the correction (c) as in Table A-2. Square ¢ 
to get c”. Be sure that c is left in units of interval. 

(6) Find o from the following formula: 


In our example, i = 5, Sfx? = 84, N ='40 and c? = Hil: 
Substituting these values in the formula, we get © = 7.23, It 
will be clear that in computing © we make use of the same 
quantities used in finding the mean; only the Sf2”* is new. 


CORRELATION 


Correlation (page 27) is the correspondence or relationship 
between two sets of test scores. Degree of relationship is ex- 
pressed by a coefficient of correlation (7) along a scale which 
extends from —1.00 to +1.00 through .00. There are several 
methods of computing correlation, of which the product-moment 
method is the most often employed in dealing with test scores. 
Calculation of a product-moment 7 is illustrated in Table A-5. 

Table A-5 shows the computation of the correlation between 
test scores in reading and arithmetic achieved by ten children 
in the fifth grade. The sample is much too small to give an ade- 
ationship between these two variables, 


quate indication of the rel 
as a much simplified illustration of 


and our table must be taken 
correlational method. 

The coefficient of correlation in Table A-5 is’.23, revealing a 
positive but quite low relationship between the two tests. The 
first test (reading) is designated X, and the second test (arith- 
metic) is Y. Note that, in order to compute the correlation, we 
must first find the deviation of each child’s X-score from Mx 
and the deviation of his Y-score from My. Each deviation from 
My (53) is entered in the x column, and each deviation from 
My (21) is entered in the y column. Each x and y is then squared 
and entered in the x? and y? columns, and the sums of these two 


252 Statistical Supplement 


TABLE A-5 


Correlation between Reading and Arithmetic in the 
Fifth Grade 


(N = 10) 

Reading Arithmetic 
Pupils (X) (Y) x y x? y? xy 
John 60 2 S oa 5 49 25 35 
Carol 55 24 2 3 4 9 6 
Ann 63 18 10 = 100 9 —30 
Betty 40 21 —13 169 0 o 
Louise 52 17 8—1 —4 1 16 4 
Tom 61 20 8. -1 64 1 — 8 
Bill 43 15 —10 —6 100 36 60 
Joan 56 25 3 4 9 16 12 
Dick 44 23 -9 2 81 4 —18 
Carl 56 21 3 0 9 0 0 

530 210 


Xx? = 586 Sy? = 116 Xxvy =61 
Mx=53.0 My=21.0 


61 ny 
V 586 X 116 


.23 


ee ee 


columns are found. In the last column (xy), the x and y devia- 
tions of each pupil are multiplied with due regard for sign, and 
the sum of the xy column is determined. Finally, the sum of the 
xy column is divided by the square root of the product of the 
Ix” and Sy? to give the coefficient of correlation. The formula is 


Sry 
Vt = Sy" 


t= 


The formula for r may 


be written in a number of ways. The 
form selected for use w 


‘ ill depend on the character of the data, 
size of the sample, purpose of the experimenter, and other con- 


siderations. Whenever N is more than about 50, the correlation 
coefficient should be com 


Chapter 2). 


puted from a diagram (see references; 


APPENDIX B 


PUBLISHERS OF MENTAL: TESTS 


Teachers who do much testing should write the publishers 

below for their catalogs. 

Bureau of Publications, Teachers College, Columbia University, 
New York 27, New York. 

California Test Bureau, 5916 Hollywood Boulevard, Los Angeles 
28, California. 

Educational Test Bureau, 720 Washington Avenue, S.E., Minne- 
apolis 14, Minnesota. 

Educational Testing Service, Cooperative Test Division, 20 Nas- 
sau Street, Princeton, New Jersey. 

Houghton Mifflin Company, 2 Park Street, Boston 7, Massachu- 


setts. 

Psychological Corporation, 304 East 45th Street, New York 17, 
New York. : 

Public School Publishing Company, 509-513 North East Street, 


Bloomington, Illinois. 
Science Research Associates, Inc., 57 West Grand Avenue, Chi- 


cago 10, Illinois. v 

C. H. Stoelting and Company, 424 North Homan Avenue, Chi- 
cago 24, Illinois. 

Stanford University Press, Stanford, California. 

World Book Company, 313 Park Hill Avenue, Yonkers 5, New 


York. 


253 


GLOSSARY 


achievement test A test designed to measure pupil performance 
in some school subject. , 

age equivalent The chronological agè assigned to an obtained 
score on a test representing the typical (average) age correspond- 
ing to the score. Example: reading age = 8-4. 

age norm Typical performance on a test expressed in age equiv- 
alents. 

alternate forms of a test Equivalent or parallel forms of a test. 
aptitude test A test designed to measure potential ability; spe- 
cifically, a test to predict future success in a school subject or in 


a vocation. ; 
attitude test A test designed to measure likes or dislikes in a 
given area. Example‘ attitude towards war. 

battery A group Af tests, often combined into a team, designed 


to measure a variety of abilities or aptitudes. 

biserial r A coefficient of correlation often used to measure the 
discriminative power of an item in analysis. 

central tendency A measure typical of a group of scores; a mean, 


median or mode. 3 
chronological age (C.A.) Life age expressed in years and months. 


Thus, 10-4 means 10 years and 4 months. 
completion items Test questions in which the examinee must fill 
statement or sentence in order to complete 


in blank spaces in a 
the meaning. 
correlation The tendency for one test to be related (or unre- 
lated) to another test. 

criterion Any measure of per 
compared in determining validity. 


255 


formance with which a test is 


256 Glossary 


deviation IQ A standard score found by converting raw scores 
into a distribution with a mean = 100 and a © of 15 or 16 points. 
diagnostic tests Tests designed to reveal pupils’ strengths and 
weaknesses in school subjects. 

discriminating power A test item which separates good from 
poor students has discriminating power. 


distractor An option in a multiple-choice test that is incorrect. 
essay items Test items calling for a relatively free response. | 
evaluation Appraisal of a pupil's performance; may include in- 
school and out-of-school vehaviors. 


frequency distribution An arrangement of test scores into groups 
in order of size. 


grade equivalent The grade score assigned to a given obtained 
score on a test. Example: A score of 42 on an achievement test 


may have a grade equivalent of 6.5 (halfway through the sixth 
grade). 


graphic rating scale A rating device in which possession of a 


given degree of some trait is indicated by a check along a line. 


group test A test that may be administered to all members of a 
group or class at the same time. 


individual test A test administered to only one person at a time. 


IQ (intelligence quotient) Originally, the ratio of mental age to 
chronological age when mental age’is obtained from an Age 
Scale. Often used loosely to mean any set of scores with a mean 
of 100. See deviation IQ. 


intelligence tests Tests designed to measure intelligence, which 


may be defined as mental alertness or ability to do well in school. 
inventory A test or checklist of a 


person’s personal characteris- 
tics, attitudes, or interests, 


item A single question on a test. 


item analysis The process of determining the difficulty and 
validity of test items through statistical analysis. 


matching items Test items in which the members of one list are 


to be matched against the members of a second list. 


Glossary 257 


mean The arithmetic average of a set of test scores. 

median The point that divides a frequency distribution of 
scores into two equal parts. 

mental age (MA) The age for which an obtained score on an 
intelligence test is average or typical. 

mode The score which occurs most often in a distribution. 
multiple-choice items Test items which call for the selection of 
a correct answer from among several options. 

normal probability curve A theoretical distribution curve which 
many distributions of test scores approximate. 

norms Average performances for various groups—expressed as 
age or grade equivalents for school children, as percentiles, and 
in other ways. 

objective test A test answered by checking or circling a number 
or letter. Example: True-false test items. 

options Responses from among which an examinee must make 
a selection. 

percentile rank (PR) The equivalent to an obtained score on a 
scale of 100 points. Example: If a score of 86 has a percentile 
rank (PR) of 63, we know that 63 of the group scored below 86. 
personality test A test (often an inventory) designed to assess 
an individual’s personal and social behaviors. 

power test A test designed to measure level of performance 


rather than speed. 
profile A graphic dev 
on several tests. 
projective tests 
of ink blots, pictures, designs. 

quartile deviation (Q) A measure of variability. Q equals one- 
half of the range of the middle 50 per cent of scores. 
questionnaire A systematic inventory of questions covering per- 


sonality traits, attitudes, or interests. 
readiness test A measure of a child’s readiness or maturity level. 


Often used in reading. 


ice for representing an examinee’s scores 


Devices for studying personality’through the use 


258 Glossary 


reliability Consistency of test scores. 
reliability coefficient Correlation coefficient giving the self-corre- 
lation of a test. 


skewness The extent to which a distribution of scores is off 
center or biased. 


sociometry Measurement of interpersonal relations within a class 
‘ 
or other group. 


split-half reliability «Reliability coefficient found by splitting a 
test into halves. The two parts of the test usually consist of odd- 
and even-numbered itents. 

standard deviation (SD oro) A measure of variability. 

standard score A converted or derived score found by express- 
ing an obtained score as being so far above or below the mean 
in SD units. 


standardized tests Printed tests for which there are norms on 


defined groups. Directions are carefully prescribed. 

test-retest reliability The correlation between scores made on 
the same test administered on two occasions. 

T-score A normalized score. 

true-false items 


Test items which the examinee is to mark as 
true or false. 


validity The degree to which a test measures what it purports 
to measure. There are several sorts of validity. 


z-score An obtained score expressed as a deviation from the 
test mean in terms of o When z-scores are converted into a 
frequency distribution with an assigned mean and o, they are 
called standard scores. 


AUTHOR INDEX 


Anastasi, A., 12, 77, 128, 155, 182 McNemar, Q., 78 
Arthur, G. A., 78 Merril!, M. A., 78 
Bean, K. L., 225 Morgar,C. L., 208 

7 2 5.2) 

J., 77, 100, 155 Noll, V. H., 42, 100, 155, 225 

sa Remmers, H. H., 208 
Freeman, F. S., 12, 77, 100, 182 Robbins, I., 208 
Garrett, H. E., 42 Ross, C. C., 12, 42, 208, 225 


Gerberich, J. R., 128, 207 Ryden, E. R., 208 


Cronbach, I 


pant F. L., 100 Stanley, J. C., 12, 42, 208, 225 
reene, E. B., 155 ~ 
Gne H.A A Terman, L. M., 78 

iy a Thorndike, R. L., 12, 42, 100 
Hagen, E., 12, 42, 100 Travers, R. M., 182, 208, 225 
Jordan, A. M., 129, 182 Traxler, A. E., 129 
Jorgenson, A. N., 128 Wechsler, D., 78 
Justman, J., 208 Wrightstone, J. W., 208 


259 


SUBJECT INDEX 


Ability, meaning of, 46 

Achievement tests, 106; definition of, 
102; diagnostic uses of, 116;.survey, 
106; teacher-made, 210; value of, 115 

Adjustment inventories, 164 

Age norms, 41 

Age scale, 32; value of, 33 

American Council on Education Psy- 
chological Examination (ACE), 91 

Aptitudes, meaning of, 130 

Aptitude tests, art, 151; batteries of, 
133; case studies of the use of, 226; 
clerical, 137; how to judge, 154; in- 
terpreting scores in, 155; mechanical, 
133; music, 151; use in professional 
schools, 147 

Army Alpha test, 8, 81 

Army Beta test, 7 

Army General Classification Test 
(AGCT), 8, 81 

Art aptitude tests, 151 

Arthur Point Scale, 73; use of, in 
schools, 75 

Ascendance-Submission ReactionStudy 
(Allport), 171 


Attitudes, 171; questionnaires in the 
study of, 172 ff. 


Bell Adjustment Inventory, 169 

Bennett Mechanical Comprehension 
Test, 135 

Binet, Alfred, 6; characteristics of his 
tests, 6-7 


Biserial r, in item analysis, 215-216 


California Achievement Tests, 111; 
characteristics of, 113 

California Arithmetic Test, 212 

California Test of Mental 
84; description of, 85-87 

California Test of Personality, 166-167 


Maturity, 


Central tendency, meaning of, 19 

Clerical aptitude tests, 137 K 

Columbia Research Bureau Spanish 
Test, 211 

Combining test scores, 34-37 5 

Completion-test items, 202; illustrations 
of, 203-204 

Content analysis, 31, 126, 213 

Cooperative French Test, 123 y 

Cooperative General Achievement 
Tests, 113-114 

Cooperative Mathematics Test, 122 

Cooperative Science Test, 125 

Correction for guessing, 191; when to 
use, 194 

Correlation, meaning of, 27-28 _ 

Correlation coefficient, computation of, 
251-252 

Criteria, in validity, 154 


Diagnostic tests, clinical, 57-59, 67-68; 
differential, 140-144; educational, 52- 
56, 68, 70-72, 75-77, 93-95 t 

Diagnostic Tests of Achievement in 
Music, 152-153 

Differential Aptitude Tests (DAT), 
140-144 

Educational achievement tests, 102- 
103; and intelligence tests, 103; com- 
pared with school examinations, 104- 
106; in school subjects, 118; how 
used in schools, 115-118; what to 
look for in, 125-128 

Educational age (EA), 127 

Essay tests, described, 204-205; how to 
improve, 205-206; scoring in, 206- 
207 

Evaluation and Adjustment Series, 122 


Frequency distribution, 14-15; rules for 
constructing, 239-241 


260 


Subject Index 


Frequency polygon, 15; how to con- 
struct, 241-242 


Galton, Francis, role in testing move- 
ment, 5-6 

General Clerical Test, 139 

Gordon’s Personal Profile and Personal 
Inventory, 168-169 

Grade norms, 41 

Group tests, of intelligence, 80-82; in 
guidance, 93-95; norms in, 98; relia- 
bility of, 97; scaling in, 97; use in 
schools, 92-96 

Guidance, educational and vocational, 
93-95, 115-117, 229-233 


Halo effect in ratings, 162 
Histogram, 16-17; how to construct, 
242-243 


Individual differences, importance of, 
3; 12 

Intelligence, meaning of, 45; levels of, 
46-47 

Intelligence quotient (IQ), Stanford- 
Binet, 51-52; constancy of, 61-63; dis- 
tribution of, 53-54; precautions in 
A Serpres 60-61; stability of, 56- 
7 

Intelligence quotient (1Q), Wechsler- 
Bellevue, 65-67; in diagnosis, 67-68; 
range of, 67 

Intelligence tests, factors in the choice 
of, 96-100; group, 80-81; individual, 
44-45; performance, 72-75 

Interest inventories, 174-180 

Iowa Silent Reading Tests, 121-122 

IQ (intelligence quotient), 33; as 
standard score, 39; as ratio, 52. See 
Intelligence Quotient 

Item analysis, 214-221; short method 
of, 219-221 

Item (test), difficulty of, 213; selection 
of, 212-213; validity of, 214 


Kuder Preference Record, 177-179 
Kuhlmann-Anderson Intelligence 
Tests, 88-89 


Law School Admission Test, 149 


MacQuarrie Test of Mechanical Abil- 
ity, 134-135 


261 


Matching items, 199; illustrations of, 
200-202 

Mean, 20; in frequency distribution, 
243-246 

Mechanical aptitude tests, 133-137 

Median, 20; in frequency distribution, 
246-247 

Medical College Admission Test, 148- 
149 

Meier A't Judgment Test, 153-154 

Mental age (MA), 32-33 

Mental tests,vlassification of, 3-5; com- 
pared with physical, 2-3; history of, 
5-17; uses of, in schools, 12 

Metropolitan Achievement Tests, 109- 
111 

Metropolitan Readiness Tests, 118-120 

Minnesota Clerical Test, 137-139 

Minnesota Paper Formboard Test, 144- 

145 

Mode, 20-21, 247-248 

Multiple-choice items, 193-194; illus- 
trations of, 195-197 

Multiple response items, 198-199 

Murphy-Durrell Diagnostic Reading 
Readiness Test, 145-146 

Musical aptitude tests, 151-153 

National Teacher Examination, 150 

Nelson-Denny Reading Test, 212 

Normal distribution, 17; uses of, in 
testing, 17-19 

Normal probability curve, 17; areas 
under, 23 

Norms, 40; age, 33; percentile, 355. 
standard scores as, 36-38 

Objectives, educational, 105-106 

Objective tests, 80, 105; compared with 
essay examinatjons, 185-189; item 
types in, 185 

Occupational Interest Inventory, 175- 


177 e 

Orleans Algebra Prognosis Test, 146- 
147 

Otis Quick-Scoring Mental Ability 
Tests, 87-88 


Percentile rank, 25-27; advantages of, 
33-36; limitations of, 36; norms in 
terms of, 35 

Percentile scale, 33-36 

Performance tests, 72-75 


262 


Personality, meaning of, 157-158; in- 
ventories in the measurement of, 
164; rating scales in the measure- 
ment of, 158-163; sociometric tech- 
niques in, 233-237 

Personality inventories, 164-180; sum- 
mary on the use of, 180-182 

Pintner’s Aspects of Personality, 167- 
168 

Pintner-Cunningham Primary Tests, 
82-84 

Pre-Engineering Ability Tests, 149-150 

Profiles, use of, in comparing test re- 
sults, 35, 85, 143 

Projective tests, meaning of, 158 


Quartile, meaning of, 24-25 

Quartile deviation (Q), calculation of, 
248-249 

Questionnaires, 164 


T, coefficient of correlation, 27-28; cal- 
culation of, 251-252 

Range of scores, 14, 248 

Rating scales, 158-160; factors affect- 
ing, 160-162; improvement in, 162- 
163; summary on, 163 

Reliability of a test, coefficient of, 27- 
29; parallel forms in, 29; split-half 


technique in, 223-224; test-retest in, 
29 


Seashore Measures of Musical Talent, 
151-152 

Selection of tests 
125-128, 154-158 

Sequential Tests of Educati 
ress, 114-115 

Sigma Scores, meaning of, 36 

Sociometric techniques, 233-237 

Standard deviation, 22; calculation of, 


in simple series, 22; in a frequenc 
distribution, 249-251 a ” 


» factors in, 96-100, 


onal Prog- 


Subject Index 


Standard error, of a score, 29-30 
Standard scores, 36; computation of, 
36-38; normalized or T-scores, 40 

Standardized tests, 210-211 

Stanford Achievement Tests, 106-109 

Stanford-Binet Intelligence Scale, 47- 
52; reliability of, 56-57; scoring in, 
51; uses of, in the schools, 52-60; 
validity of, 61-63 

Strong Vocational Interest Blank, 179- 
180 

Study of Values 


(Allport-Vernon- 
Lindzey), 172-173 


Teacher-made tests, 219 ff. 

Terman-McNemar Test of Mental 
Ability, 89-91 

Test items, varieties of, 185 ff. 

Thurstone Temperament Schedule, 
170-171 , 

True-False items, 189-190; illustrations 
of, 192-193 

T-score, 40 

Turse Shorthand Aptitude Test, 147 


Validity, of a test, 30-31; of test items, 
214 

Variability, in scores, 21 g 

Verbal ability, and performance abil- 
ity, 64-65, 68, 75-77 


Wechsler-Bellevue Intelligence Scale, 
6? 67; in diagnosis, 68; in the schools, 
67-68 

Wechsler Adult Intelligence Scale, 63 

Wechsler Intelligence Scale for Chil- 
dren, 68-69; compared with Stan- 
ford-Binet, 70; MA in, 72; range and 
stability of 1Q’s in, 71 


Z-score, 36 


z 


f 
i 


} 


} 


f 


5 
K 


% 
| 
j 


Form No. 3. 


PSY, RES.L-1 


‘Bureau of Educational & Psychological 


$ 2 


~ 


he 


Research Library. 
a a 


The book is to be returned within 
the date stamped last. 


Ree see Sore Ie aithine ssn s5s Nea ss ted T 


x ž YS a 
WBGP-59/60-51 19C-5M 


